Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Figure 3. The generated video from the prompt "Wonderwoman is skiing".
Tune-A-Video is a method for one-shot video tuning task. This task involves finetuning a pretrained T2I (Text-to-Image) model (e.g., Stable Diffusion) on one reference video to generate an edited video based on an edited text prompt. For example, given a reference video with the text prompt "A man is skiing", as shown below, the authors finetuned the Text-to-Image model on this reference video. The resulted Text-to-Video model will generates another video based the text prompt "Spider Man is skiing on the beach, cartoon style".
Figure 1. The diagram of one-shot video tuning task. [1]
Figure 2. The diagram of Tune-A-Video. [1]
The picture above shows the method proposed in Tune-A-Video paper. First, the authors designed a sparse spatial-temporal attention layer to model the spatial-temporal relations efficienly. They replaced the first self-attention layer in the Transformer block inside the UNet by their custom spatial-temporal attention layer (ST-Attn). They also inserted another temporal attention layer after the FeedForward layer to further learn the temporal relations. Secondly, they proposed to use DDIM inverted noise from the reference video latent features as the input to the UNet during inference. It improves the spatial-temporal consistency of the generated video compared with the reference video.
With two techniques above, Tune-A-Video achieves high flexibility and fidelity in text-based video editing task with a small computation cost.
For information on how to train and infer with the model, please have a look at MindOne GitHub Repository.
The model is intended for research purposes only. Possible research areas and tasks include
Excluded uses are described below.
The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.