开放域文本到视频合成模型

该模型基于多阶段文本到视频生成扩散模型，输入描述文本并返回与文本描述匹配的视频。仅支持英文输入。

我们正在招聘！（工作地点：中国北京/杭州）

如果您正在寻找充满挑战的工作机会，并希望投身AIGC和大规模预训练等前沿技术领域，那么这里就是您的理想之选。我们正在寻找才华横溢、积极主动且富有创造力的人才加入我们的团队。如果您感兴趣，请将您的简历发送给我们。

模型描述

文本到视频生成扩散模型由三个子网络组成：文本特征提取模型、文本特征到视频 latent 空间扩散模型，以及视频 latent 空间到视频视觉空间模型。整体模型参数约为17亿。目前仅支持英文输入。扩散模型采用 UNet3D 结构，通过从纯高斯噪声视频进行迭代去噪过程实现视频生成。

本模型仅供研究使用。请参阅模型局限性、偏见与误用、恶意使用和过度使用部分。

模型详情

开发方： ModelScope
模型类型： 基于扩散的文本到视频生成模型
支持语言： 英语
许可证： CC-BY-NC-ND
更多信息资源： ModelScope GitHub 仓库，模型概述。
引用格式：

应用场景

该模型应用广泛，能够基于任意英文文本描述进行推理并生成视频。

使用方法

首先，让我们安装所需的库：

$ pip install diffusers transformers accelerate torch

现在，生成一段视频：

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)

以下是一些结果：

一名宇航员骑着马。

达斯·维达在海浪中冲浪。
Darth vader surfing in waves.

长视频生成

您可以通过启用注意力和VAE切片以及使用Torch 2.0来优化内存使用。这应该能让您在少于16GB的GPU显存上生成长达25秒的视频。

$ pip install git+https://github.com/huggingface/diffusers transformers accelerate

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

# load pipeline
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# optimize for GPU memory
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()

# generate
prompt = "Spiderman is surfing. Darth Vader is also surfing and following Spiderman"
video_frames = pipe(prompt, num_inference_steps=25, num_frames=200).frames

# convent to video
video_path = export_to_video(video_frames)

查看结果

上述代码会显示输出视频的保存路径，当前编码格式可使用 VLC player 进行播放。

输出的 mp4 文件可通过 VLC media player 查看。其他一些媒体播放器可能无法正常查看。

模型局限性与偏差

本模型基于 Webvid 等公开数据集训练，生成结果可能存在与训练数据分布相关的偏差。
该模型无法实现完美的影视级质量生成。
模型无法生成清晰的文本。
模型主要使用英语语料训练，目前不支持其他语言**。
该模型在复杂组合生成任务上的性能有待提升。

误用、恶意使用和过度使用

本模型未针对真实地表现人物或事件进行训练，因此使用它来生成此类内容超出了模型的能力范围。
禁止生成对人或其环境、文化、宗教等具有贬低性或有害的内容。
禁止用于生成色情、暴力及血腥内容。
禁止用于生成错误和虚假信息。

训练数据

训练数据包括 LAION5B、ImageNet、Webvid 等公开数据集。在预训练后会进行图像和视频过滤，例如美学评分、水印评分和去重等。

(本模型卡片部分内容取自 here)

引用

    @article{wang2023modelscope,
      title={Modelscope text-to-video technical report},
      author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
      journal={arXiv preprint arXiv:2308.06571},
      year={2023}
    }
    @InProceedings{VideoFusion,
        author    = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
        title     = {VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2023}
    }