Video Gallery with Captions

CogVideoX-2B

📄 中文阅读 | 🤗 Huggingface Space | 🌐 Github | 📜 arxiv

📍 访问清英和 API 平台体验商业视频生成模型。

演示展示

一艘精致的木制玩具船，拥有雕刻精美的桅杆和船帆，正平稳地滑过一块毛绒绒的蓝色地毯，地毯宛如海浪。船身漆成浓郁的棕色，还带有小巧的窗户。这块柔软且富有质感的地毯构成了完美的背景，仿佛一片广阔的海洋。船的周围摆放着其他各种玩具和儿童用品，营造出一种充满童趣的环境。这一场景捕捉了童年的纯真与想象力，玩具船的航行象征着在这个奇妙的室内环境中无尽的冒险。

镜头跟拍一辆白色复古SUV，车顶装有黑色行李架，它正沿着陡峭的山路加速行驶，周围是松树环绕的陡峭山坡，轮胎扬起阵阵尘土。阳光洒在SUV上，为整个场景镀上一层温暖的光芒。土路蜿蜒伸向远方，路上看不到其他车辆。道路两旁是红杉树，其间点缀着片片绿意。车辆从后方看去，轻松地沿着弯道行驶，仿佛正在崎岖的地形中进行一场粗犷的驾驶。土路本身被陡峭的丘陵和山脉环绕，头顶是清澈的蓝天，飘着几缕白云。

一位街头艺术家，身着一件破旧的牛仔夹克，头戴一条色彩鲜艳的头巾，站在市中心一面巨大的混凝土墙前，手持一罐喷漆，正在斑驳的墙面上喷涂一只色彩斑斓的鸟。

在一个饱受战争蹂躏的城市背景下，断壁残垣诉说着毁灭的故事。镜头聚焦于一个小女孩的特写，她的脸上沾满了灰尘，无声地见证着周围的混乱。她的眼中闪烁着悲伤与坚韧交织的光芒，捕捉到了一个因冲突而失去纯真的世界所饱含的原始情感。

模型介绍

CogVideoX 是源自清英的开源版本视频生成模型。下表展示了我们目前提供的视频生成模型列表及其基础信息。

模型名称	CogVideoX-2B（本仓库）	CogVideoX-5B
模型描述	入门级模型，兼顾兼容性。运行及二次开发成本低。	更大模型，视频生成质量更高，视觉效果更好。
推理精度	*FP16（推荐）*、BF16、FP32、FP8、INT8，不支持 INT4	BF16（推荐）、FP16、FP32、FP8*、INT8，不支持 INT4
单 GPU 显存占用	SAT FP16: 18GB diffusers FP16: 最低 4GB* diffusers INT8(torchao): 最低 3.6GB*	SAT BF16: 26GB diffusers BF16: 最低 5GB* diffusers INT8(torchao): 最低 4.4GB*
多 GPU 推理显存占用	FP16: 使用 diffusers 时 10GB*	BF16: 使用 diffusers 时 15GB*
推理速度（步数 = 50，FP/BF16）	单 A100: ~90 秒单 H100: ~45 秒	单 A100: ~180 秒单 H100: ~90 秒
微调精度	FP16	BF16
微调显存占用（每 GPU）	47 GB（批大小=1，LORA） 61 GB（批大小=2，LORA） 62GB（批大小=1，SFT）	63 GB（批大小=1，LORA） 80 GB（批大小=2，LORA） 75GB（批大小=1，SFT）
提示词语言	English*
提示词长度限制	226 tokens
视频时长	6 秒
帧率	8 帧/秒
视频分辨率	720 x 480，不支持其他分辨率（包括微调）
位置编码	3d_sincos_pos_embed	3d_rope_pos_embed

数据说明

使用 diffusers 库进行测试时，已启用 diffusers 库提供的所有优化。此方案尚未在 NVIDIA A100 / H100 以外的设备上测试实际显存/内存使用情况。通常，此方案可适配所有 NVIDIA Ampere 架构及以上的设备。若禁用优化，显存占用将显著增加，峰值显存占用约为表中所示的 3 倍。但速度会提升 3-4 倍。您可以选择性地禁用部分优化，包括：

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

执行多GPU推理时，需禁用 enable_model_cpu_offload() 优化。
使用INT8模型会降低推理速度。这是为了确保显存较低的GPU能够正常进行推理，同时保持最小的视频质量损失，但推理速度会显著下降。
2B模型采用 FP16 精度训练，5B模型采用 BF16 精度训练。建议使用模型训练时的精度进行推理。
可使用 PytorchAO 和 Optimum-quanto 对文本编码器、Transformer 和 VAE 模块进行量化，以降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或显存较小的 GPU 上运行模型成为可能！值得注意的是，TorchAO 量化与 torch.compile 完全兼容，可显著提升推理速度。FP8 精度必须在 NVIDIA H100 及以上设备上使用，这需要从源码安装 torch、torchao、diffusers 和 accelerate Python 包。建议使用 CUDA 12.4。
推理速度测试也采用了上述显存优化方案。若不进行显存优化，推理速度约提升10%。仅 diffusers 版本的模型支持量化。
该模型仅支持英文输入；其他语言可在优化阶段通过大模型翻译成英文。

注意

使用 SAT 对 SAT 版本模型进行推理和微调。欢迎访问我们的 GitHub 了解更多信息。

快速开始 🤗

该模型支持使用 huggingface diffusers 库进行部署。您可以按照以下步骤进行部署。

建议您访问我们的 GitHub，查看相关的提示词优化和转换方法，以获得更好的体验。

安装所需依赖项

# diffusers>=0.30.1
# transformers>=0.44.0
# accelerate>=0.33.0 (suggest install from source)
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

运行代码

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

量化推理

可以使用 PytorchAO 和 Optimum-quanto 对文本编码器、Transformer 和 VAE 模块进行量化，以降低 CogVideoX 的内存需求。这使得在免费层级的 T4 Colab 或显存更小的 GPU 上运行模型成为可能！值得注意的是，TorchAO 量化与 torch.compile 完全兼容，可显著提升推理速度。

# To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
# Source and nightly installation is only required until next release.

import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
+ from transformers import T5EncoderModel
+ from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight

+ quantization = int8_weight_only

+ text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
+ quantize_(text_encoder, quantization())

+ transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
+ quantize_(transformer, quantization())

+ vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.bfloat16)
+ quantize_(vae, quantization())

# Create pipeline and run inference
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
+    text_encoder=text_encoder,
+    transformer=transformer,
+    vae=vae,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

此外，使用 PytorchAO 时，模型可以进行序列化并以量化数据类型存储，以节省磁盘空间。

相关示例和基准测试可在以下链接中找到：

探索模型

欢迎访问我们的 github，您将在其中找到：

更详细的技术细节和代码说明。
提示词的优化与转换。
SAT 版本模型的推理、微调，乃至预发布版本。
项目更新日志动态，更多互动机会。
CogVideoX 工具链，助您更好地使用模型。
INT8 模型推理代码支持。

模型许可

CogVideoX-2B 模型（包括其相应的 Transformers 模块和 VAE 模块）基于 Apache 2.0 License 发布。

CogVideoX-5B 模型（Transformers 模块）基于 CogVideoX LICENSE 发布。

引用

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}