LTX-2 模型卡片

本模型卡片聚焦于 LTX-2 模型，相关论文为 LTX-2: Efficient Joint Audio-Visual Foundation Model。代码库可通过此处获取。

LTX-2 是一款基于 DiT 的音视频基础模型，旨在通过单一模型生成同步的视频和音频。它整合了现代视频生成的核心构建模块，提供开放权重，并专注于实用的本地执行。

模型检查点

名称	说明
ltx-2-19b-dev	完整模型，支持 bf16 精度，具备灵活性且可训练
ltx-2-19b-dev-fp8	采用 fp8 量化的完整模型
ltx-2-19b-dev-fp4	采用 nvfp4 量化的完整模型
ltx-2-19b-distilled	完整模型的蒸馏版本，8 步推理，CFG=1
ltx-2-19b-distilled-lora-384	蒸馏模型的 LoRA 版本，适用于完整模型
ltx-2-spatial-upscaler-x2-1.0	LTX-2 潜在空间的 2 倍空间上采样器，用于多阶段（多尺度）流水线以获得更高分辨率
ltx-2-temporal-upscaler-x2-1.0	LTX-2 潜在空间的 2 倍时间上采样器，用于多阶段（多尺度）流水线以获得更高 FPS

模型详情

开发机构： Lightricks
模型类型： 基于扩散的音视频基础模型
支持语言： 英语

在线演示

LTX-2 可通过以下链接直接访问：

本地运行

直接使用许可

您可以根据许可协议将模型（完整模型、蒸馏模型、超分辨率模型及任何模型衍生物）用于许可范围内的用途。

ComfyUI

我们建议您使用 ComfyUI 管理器中内置的 LTXVideo 节点。有关手动安装信息，请参考我们的文档网站。

PyTorch 代码库

LTX-2 代码库是一个包含多个软件包的单体仓库。从 'ltx-core' 中的模型定义，到 'ltx-pipelines' 中的管道，再到 'ltx-trainer' 中的训练功能，一应俱全。该代码库已在 Python >=3.12、CUDA 版本 >12.7 的环境下测试通过，并支持 PyTorch ~= 2.7。

安装

git clone https://github.com/Lightricks/LTX-2.git
cd LTX-2

# From the repository root
uv sync
source .venv/bin/activate

推理

若要使用我们的模型，请遵循 ltx-pipelines 软件包中的说明。

Diffusers 🧨

LTX-2 在 Diffusers Python 库中支持文本到视频及图像到视频生成。有关 LTX-2 与 diffusers 的更多信息，请参见此处。

与 diffusers 配合使用

为实现生产级质量的生成，建议使用两阶段生成流程。文本到视频的两阶段推理示例：

import torch
from diffusers import FlowMatchEulerDiscreteScheduler
from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.utils import STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video

device = "cuda:0"
width = 768
height = 512

pipe = LTX2Pipeline.from_pretrained(
    "Lightricks/LTX-2", torch_dtype=torch.bfloat16
)
pipe.enable_sequential_cpu_offload(device=device)

prompt = "A beautiful sunset over the ocean"
negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static."

# Stage 1 default (non-distilled) inference
frame_rate = 24.0
video_latent, audio_latent = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=40,
    sigmas=None,
    guidance_scale=4.0,
    output_type="latent",
    return_dict=False,
)

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    "Lightricks/LTX-2",
    subfolder="latent_upsampler",
    torch_dtype=torch.bfloat16,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
upsample_pipe.enable_model_cpu_offload(device=device)
upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

# Load Stage 2 distilled LoRA
pipe.load_lora_weights(
    "Lightricks/LTX-2", adapter_name="stage_2_distilled", weight_name="ltx-2-19b-distilled-lora-384.safetensors"
)
pipe.set_adapters("stage_2_distilled", 1.0)
# VAE tiling is usually necessary to avoid OOM error when VAE decoding
pipe.vae.enable_tiling()
# Change scheduler to use Stage 2 distilled sigmas as is
new_scheduler = FlowMatchEulerDiscreteScheduler.from_config(
    pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None
)
pipe.scheduler = new_scheduler
# Stage 2 inference with distilled LoRA and sigmas
video, audio = pipe(
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=3,
    noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0], # renoise with first sigma value https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py#L218
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    output_type="np",
    return_dict=False,
)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_lora_distilled_sample.mp4",
)

如需更多推理示例，包括使用蒸馏 checkpoint 进行生成，请访问此处。

通用提示：

宽度和高度设置必须能被 32 整除。帧数必须能被 8 + 1 整除。
如果分辨率或帧数不能被 32 或 8 + 1 整除，输入应填充 -1，然后裁剪到所需的分辨率和帧数。
有关编写有效提示词的技巧，请访问我们的提示词指南

局限性

此模型并非旨在提供也无法提供事实性信息。
作为一种统计模型，此 checkpoint 可能会放大现有的社会偏见。
该模型可能无法生成与提示词完全匹配的视频。
提示词遵循程度很大程度上受提示词风格的影响。
该模型可能会生成不当或冒犯性内容。
生成无语音音频时，音频质量可能较低。

训练模型

基础（开发）模型是完全可训练的。

按照 LTX-2 Trainer 自述文件上的说明，非常容易复现我们随模型发布的 LoRA 和 IC-LoRA。

在许多情况下，针对运动、风格或相似度（声音+外观）的训练可在不到一小时内完成。

引用

@article{hacohen2025ltx2,
  title={LTX-2: Efficient Joint Audio-Visual Foundation Model},
  author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and Bitterman, Yaki and Kvochko, Andrew and Berkowitz, Avishai and Shalem, Daniel and Lifschitz, Daphna and Moshe, Dudu and Porat, Eitan and Richardson, Eitan and Guy Shiran and Itay Chachy and Jonathan Chetboun and Michael Finkelson and Michael Kupchick and Nir Zabari and Nitzan Guetta and Noa Kotler and Ofir Bibi and Ori Gordon and Poriya Panet and Roi Benita and Shahar Armon and Victor Kulikov and Yaron Inger and Yonatan Shiftan and Zeev Melumian and Zeev Farbman},
  journal={arXiv preprint arXiv:2601.03233},
  year={2025}
}