Tune-A-Video

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Figure 3. The generated video from the prompt "Wonderwoman is skiing".

Introduction

Tune-A-Video is a method for one-shot video tuning task. This task involves finetuning a pretrained T2I (Text-to-Image) model (e.g., Stable Diffusion) on one reference video to generate an edited video based on an edited text prompt. For example, given a reference video with the text prompt "A man is skiing", as shown below, the authors finetuned the Text-to-Image model on this reference video. The resulted Text-to-Video model will generates another video based the text prompt "Spider Man is skiing on the beach, cartoon style".

Figure 1. The diagram of one-shot video tuning task. [1]

Figure 2. The diagram of Tune-A-Video. [1]

The picture above shows the method proposed in Tune-A-Video paper. First, the authors designed a sparse spatial-temporal attention layer to model the spatial-temporal relations efficienly. They replaced the first self-attention layer in the Transformer block inside the UNet by their custom spatial-temporal attention layer (ST-Attn). They also inserted another temporal attention layer after the FeedForward layer to further learn the temporal relations. Secondly, they proposed to use DDIM inverted noise from the reference video latent features as the input to the UNet during inference. It improves the spatial-temporal consistency of the generated video compared with the reference video.

With two techniques above, Tune-A-Video achieves high flexibility and fidelity in text-based video editing task with a small computation cost.

How to Get Started with the Model

For information on how to train and infer with the model, please have a look at MindOne GitHub Repository.

Uses

Direct Use

The model is intended for research purposes only. Possible research areas and tasks include

Generation of artworks and use in design and other artistic processes.
Applications in educational or creative tools.
Research on generative models.
Safe deployment of models which have the potential to generate harmful content.
Probing and understanding the limitations and biases of generative models.

Excluded uses are described below.

Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

Limitations and Bias

Limitations

The model does not achieve perfect photorealism
The model cannot render legible text
The model struggles with more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”
Faces and people in general may not be generated properly.
The autoencoding part of the model is lossy.

Bias

While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.