📄 中文阅读 | 🤗 Huggingface Space | 🌐 Github | 📜 arxiv
CogVideoX 是源自清英的开源版本视频生成模型。下表展示了我们目前提供的视频生成模型列表及其基础信息。
| 模型名称 | CogVideoX-2B(本仓库) | CogVideoX-5B |
|---|---|---|
| 模型描述 | 入门级模型,兼顾兼容性。运行及二次开发成本低。 | 更大模型,视频生成质量更高,视觉效果更好。 |
| 推理精度 | FP16*(推荐)、BF16、FP32、FP8*、INT8,不支持 INT4 | BF16(推荐)、FP16、FP32、FP8*、INT8,不支持 INT4 |
| 单 GPU 显存占用 | SAT FP16: 18GB diffusers FP16: 最低 4GB* diffusers INT8(torchao): 最低 3.6GB* | SAT BF16: 26GB diffusers BF16: 最低 5GB* diffusers INT8(torchao): 最低 4.4GB* |
| 多 GPU 推理显存占用 | FP16: 使用 diffusers 时 10GB* | BF16: 使用 diffusers 时 15GB* |
| 推理速度 (步数 = 50,FP/BF16) | 单 A100: ~90 秒 单 H100: ~45 秒 | 单 A100: ~180 秒 单 H100: ~90 秒 |
| 微调精度 | FP16 | BF16 |
| 微调显存占用(每 GPU) | 47 GB(批大小=1,LORA) 61 GB(批大小=2,LORA) 62GB(批大小=1,SFT) | 63 GB(批大小=1,LORA) 80 GB(批大小=2,LORA) 75GB(批大小=1,SFT) |
| 提示词语言 | English* | |
| 提示词长度限制 | 226 tokens | |
| 视频时长 | 6 秒 | |
| 帧率 | 8 帧/秒 | |
| 视频分辨率 | 720 x 480,不支持其他分辨率(包括微调) | |
| 位置编码 | 3d_sincos_pos_embed | 3d_rope_pos_embed |
数据说明
diffusers 库进行测试时,已启用 diffusers 库提供的所有优化。此方案尚未在 NVIDIA A100 / H100 以外的设备上测试实际显存/内存使用情况。通常,此方案可适配所有 NVIDIA Ampere 架构及以上的设备。若禁用优化,显存占用将显著增加,峰值显存占用约为表中所示的 3 倍。但速度会提升 3-4 倍。您可以选择性地禁用部分优化,包括:pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()enable_model_cpu_offload() 优化。FP16 精度训练,5B模型采用 BF16 精度训练。建议使用模型训练时的精度进行推理。torch.compile 完全兼容,可显著提升推理速度。FP8 精度必须在 NVIDIA H100 及以上设备上使用,这需要从源码安装 torch、torchao、diffusers 和 accelerate Python 包。建议使用 CUDA 12.4。diffusers 版本的模型支持量化。注意
该模型支持使用 huggingface diffusers 库进行部署。您可以按照以下步骤进行部署。
建议您访问我们的 GitHub,查看相关的提示词优化和转换方法,以获得更好的体验。
# diffusers>=0.30.1
# transformers>=0.44.0
# accelerate>=0.33.0 (suggest install from source)
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-2b",
torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
export_to_video(video, "output.mp4", fps=8)可以使用 PytorchAO 和 Optimum-quanto 对文本编码器、Transformer 和 VAE 模块进行量化,以降低 CogVideoX 的内存需求。这使得在免费层级的 T4 Colab 或显存更小的 GPU 上运行模型成为可能!值得注意的是,TorchAO 量化与 torch.compile 完全兼容,可显著提升推理速度。
# To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
# Source and nightly installation is only required until next release.
import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
+ from transformers import T5EncoderModel
+ from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight
+ quantization = int8_weight_only
+ text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
+ quantize_(text_encoder, quantization())
+ transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
+ quantize_(transformer, quantization())
+ vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.bfloat16)
+ quantize_(vae, quantization())
# Create pipeline and run inference
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-2b",
+ text_encoder=text_encoder,
+ transformer=transformer,
+ vae=vae,
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
export_to_video(video, "output.mp4", fps=8)此外,使用 PytorchAO 时,模型可以进行序列化并以量化数据类型存储,以节省磁盘空间。
相关示例和基准测试可在以下链接中找到:
欢迎访问我们的 github,您将在其中找到:
CogVideoX-2B 模型(包括其相应的 Transformers 模块和 VAE 模块)基于 Apache 2.0 License 发布。
CogVideoX-5B 模型(Transformers 模块)基于 CogVideoX LICENSE 发布。
@article{yang2024cogvideox,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
journal={arXiv preprint arXiv:2408.06072},
year={2024}
}