SANA-WM (Bidirectional)

SANA-WM is an efficient open-source world model trained natively for one-minute generation. The bidirectional checkpoint released here is a 2.6B-parameter image-to-video diffusion transformer that synthesises 720p, minute-scale videos with precise 6-DoF camera control, paired with the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.

Four core designs drive the architecture:

Hybrid Linear Attention — frame-wise Gated DeltaNet combined with softmax attention every Nth block for memory-efficient long-context modelling.
Dual-Branch Camera Control — independent main and camera branches enable precise per-frame trajectory adherence.
Two-Stage Generation Pipeline — a long-video refiner stitched on top of Stage-1 latents improves quality and temporal consistency.
Robust Annotation Pipeline — metric-scale 6-DoF camera poses extracted from public video corpora yield spatiotemporally consistent action supervision.

Paper: https://arxiv.org/abs/2605.15178

@article{zhu2026sanawm,
  title   = {{SANA-WM}: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
  author  = {Zhu, Haoyi and Liu, Haozhe and Zhao, Yuyang and Ye, Tian and Chen, Junsong and Yu, Jincheng and He, Tong and Han, Song and Xie, Enze},
  journal = {arXiv preprint arXiv:2605.15178},
  year    = {2026},
}

Repository layout

Component	Path in repo	Size
Sana DiT (Stage 1)	`dit/sana_wm_1600m_720p.safetensors`	10 GB
LTX-2 VAE (diffusers)	`vae/`	2 GB
LTX-2 refiner (Stage 2)	`refiner/refiner.safetensors`	41 GB
Gemma text encoder for the refiner	`refiner/text_encoder/`	46 GB
Inference config	`config.yaml`	—

The Sana text encoder (gemma-2-2b-it) is not bundled here — it is fetched on demand from the public Hugging Face mirror.

Usage

python inference_video_scripts/inference_sana_wm.py \
  --image      asset/sana_wm/demo_0.png \
  --prompt     asset/sana_wm/demo_0.txt \
  --action     "w-80,jw-40,w-40,lw-60,w-100" \
  --translation_speed 0.055 \
  --rotation_speed_deg 1.2 \
  --num_frames 321 \
  --output_dir results/demo

Weights are fetched from this repository on first use. Pass --no_refiner to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE instead. To run fully offline, override any of --config / --model_path / --refiner_checkpoint / --refiner_gemma_root with local paths.

Inputs

Argument	Format
`--image`	RGB image (any PIL-readable format) — used as the first frame.
`--prompt`	UTF-8 text file containing the conditioning prompt.
`--camera`	NumPy `.npy` of shape `(F, 4, 4)` — per-frame camera-to-world matrices.
`--action`	WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`.
`--intrinsics`	Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25°, 120°]`.

The output frame size is fixed at 704 x 1280; input images are aspect-preserving resized + center-cropped to that resolution.

License

Released under the Apache 2.0 license. The bundled LTX-2 refiner and VAE inherit the LTX-2 upstream license.

SANA-WM (Bidirectional)

Four core designs drive the architecture:

Hybrid Linear Attention — frame-wise Gated DeltaNet combined with softmax attention every Nth block for memory-efficient long-context modelling.

Dual-Branch Camera Control — independent main and camera branches enable precise per-frame trajectory adherence.

Two-Stage Generation Pipeline — a long-video refiner stitched on top of Stage-1 latents improves quality and temporal consistency.

Robust Annotation Pipeline — metric-scale 6-DoF camera poses extracted from public video corpora yield spatiotemporally consistent action supervision.

@article{zhu2026sanawm,
  title   = {{SANA-WM}: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
  author  = {Zhu, Haoyi and Liu, Haozhe and Zhao, Yuyang and Ye, Tian and Chen, Junsong and Yu, Jincheng and He, Tong and Han, Song and Xie, Enze},
  journal = {arXiv preprint arXiv:2605.15178},
  year    = {2026},
}

Repository layout

Component	Path in repo	Size
Sana DiT (Stage 1)	`dit/sana_wm_1600m_720p.safetensors`	10 GB
LTX-2 VAE (diffusers)	`vae/`	2 GB
LTX-2 refiner (Stage 2)	`refiner/refiner.safetensors`	41 GB
Gemma text encoder for the refiner	`refiner/text_encoder/`	46 GB
Inference config	`config.yaml`	—

The Sana text encoder (gemma-2-2b-it) is not bundled here — it is fetched on demand from the public Hugging Face mirror.

Usage

python inference_video_scripts/inference_sana_wm.py \
  --image      asset/sana_wm/demo_0.png \
  --prompt     asset/sana_wm/demo_0.txt \
  --action     "w-80,jw-40,w-40,lw-60,w-100" \
  --translation_speed 0.055 \
  --rotation_speed_deg 1.2 \
  --num_frames 321 \
  --output_dir results/demo

Inputs

Argument	Format
`--image`	RGB image (any PIL-readable format) — used as the first frame.
`--prompt`	UTF-8 text file containing the conditioning prompt.
`--camera`	NumPy `.npy` of shape `(F, 4, 4)` — per-frame camera-to-world matrices.
`--action`	WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`.
`--intrinsics`	Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25°, 120°]`.

The output frame size is fixed at 704 x 1280; input images are aspect-preserving resized + center-cropped to that resolution.