LLaVA-NeXT-Video 模型卡片

您也可以查看 Google Colab 演示，在免费版 Google Colab 实例上运行 Llava：

免责声明：发布 LLaVa-NeXT-Video 的团队未为此模型编写模型卡片，因此本模型卡片由 Hugging Face 团队编写。

📄 模型详情

模型类型： LLaVA-Next-Video 是一款开源聊天机器人，通过在多模态指令遵循数据上对 LLM 进行微调训练而成。该模型在 LLaVa-NeXT 的基础上构建，通过在视频和图像数据的混合集上进行调优，以实现更出色的视频理解能力。视频被均匀采样为每个片段 32 帧。该模型在 VideoMME 基准上是当前开源模型中的 SOTA（最先进）模型。基础 LLM：lmsys/vicuna-7b-v1.5

llava_next_video_arch

模型日期： LLaVA-Next-Video-7B 于 2024 年 4 月训练完成。

更多信息的论文或资源： https://github.com/LLaVA-VL/LLaVA-NeXT

📚 训练数据集

图像

来自 LAION/CC/SBU 的 558K 过滤图像-文本对，由 BLIP 生成标题。
158K GPT 生成的多模态指令遵循数据。
500K 面向学术任务的 VQA 数据混合集。
50K GPT-4V 数据混合集。
40K ShareGPT 数据。

视频

100K VideoChatGPT-Instruct。

📊 评估数据集

包含 4 个基准的集合，包括 3 个学术 VQA 基准和 1 个标题生成基准。

🚀 如何使用模型

首先，请确保安装 transformers >= 4.42.0。该模型支持多视觉和多提示生成。这意味着您可以在提示中传递多个图像/视频。还请确保遵循正确的提示模板（USER: xxx\nASSISTANT:），并在您想要查询图像/视频的位置添加令牌 <image> 或 <video>：

以下是一个在 GPU 设备上以 float16 精度运行生成的示例脚本：

import av
import torch
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration

model_id = "llava-hf/LLaVA-NeXT-Video-34B-hf"

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = LlavaNextVideoProcessor.from_pretrained(model_id)

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


# define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image", "video") 
conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
            {"type": "video"},
            ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)

# sample uniformly 8 frames from the video, can sample more for longer videos
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)

output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

使用图像作为输入进行推理

按照上述方法加载模型后，可使用以下代码从图像生成内容：

import requests
from PIL import Image

conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What are these?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs_image = processor(text=prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

使用图像和视频作为输入进行推理

若要在一次生成中同时基于图像和视频生成内容，请在按照上述方法加载模型后使用以下代码：

conversation_1 = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What's the content of the image>"},
          {"type": "image"},
        ],
    }
]
conversation_2 = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Why is this video funny?"},
          {"type": "video"},
        ],
    },
]
prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True)
prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)

s = processor(text=[prompt_1, prompt_2], images=image, videos=clip, padding=True, return_tensors="pt").to(model.device)

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=100)
out = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(out)

从 transformers>=v4.48 版本开始，您还可以将图像/视频的 URL 或本地路径传递到对话历史中，其余部分由聊天模板处理。对于视频，您还需要指定要从视频中采样的 num_frames 数量，否则将加载整个视频。聊天模板会为您加载图像/视频，并返回 torch.Tensor 格式的输入，您可以直接将其传递给 model.generate()。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
            {"type": "video", "path": "my_video.mp4"},
            {"type": "text", "text": "What is shown in this image and video?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, num_frames=8, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
output = model.generate(**inputs, max_new_tokens=50)

模型优化

通过 `bitsandbytes` 库实现4-bit量化

首先确保已安装 bitsandbytes，可通过 pip install bitsandbytes 进行安装，并确保能够访问兼容CUDA的GPU设备。只需将上述代码片段修改为：

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_in_4bit=True
)

使用 Flash-Attention 2 进一步加速生成

首先确保已安装 flash-attn。有关该软件包的安装，请参考 Flash Attention 的原始仓库。只需将上述代码片段修改为：

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attention_2=True
).to(0)

🔒 许可证

✏️ 引用

如果您发现我们的论文和代码对您的研究有所帮助：

@misc{zhang2024llavanextvideo,
  title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
  url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
  author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
  month={April},
  year={2024}
}

@misc{liu2024llavanext,
    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
    month={January},
    year={2024}
}

LLaVA-NeXT-Video 模型卡片

您也可以查看 Google Colab 演示，在免费版 Google Colab 实例上运行 Llava：

免责声明：发布 LLaVa-NeXT-Video 的团队未为此模型编写模型卡片，因此本模型卡片由 Hugging Face 团队编写。

📄 模型详情

llava_next_video_arch

模型日期： LLaVA-Next-Video-7B 于 2024 年 4 月训练完成。

更多信息的论文或资源： https://github.com/LLaVA-VL/LLaVA-NeXT

📚 训练数据集

图像

来自 LAION/CC/SBU 的 558K 过滤图像-文本对，由 BLIP 生成标题。
158K GPT 生成的多模态指令遵循数据。
500K 面向学术任务的 VQA 数据混合集。
50K GPT-4V 数据混合集。
40K ShareGPT 数据。

视频

100K VideoChatGPT-Instruct。

📊 评估数据集

包含 4 个基准的集合，包括 3 个学术 VQA 基准和 1 个标题生成基准。

🚀 如何使用模型

以下是一个在 GPU 设备上以 float16 精度运行生成的示例脚本：

import av
import torch
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration

model_id = "llava-hf/LLaVA-NeXT-Video-34B-hf"

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = LlavaNextVideoProcessor.from_pretrained(model_id)

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


# define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image", "video") 
conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
            {"type": "video"},
            ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)

# sample uniformly 8 frames from the video, can sample more for longer videos
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)

output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

使用图像作为输入进行推理

按照上述方法加载模型后，可使用以下代码从图像生成内容：

import requests
from PIL import Image

conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What are these?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs_image = processor(text=prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

使用图像和视频作为输入进行推理

若要在一次生成中同时基于图像和视频生成内容，请在按照上述方法加载模型后使用以下代码：

conversation_1 = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What's the content of the image>"},
          {"type": "image"},
        ],
    }
]
conversation_2 = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Why is this video funny?"},
          {"type": "video"},
        ],
    },
]
prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True)
prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)

s = processor(text=[prompt_1, prompt_2], images=image, videos=clip, padding=True, return_tensors="pt").to(model.device)

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=100)
out = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(out)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
            {"type": "video", "path": "my_video.mp4"},
            {"type": "text", "text": "What is shown in this image and video?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, num_frames=8, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
output = model.generate(**inputs, max_new_tokens=50)

模型优化

通过 `bitsandbytes` 库实现4-bit量化

首先确保已安装 bitsandbytes，可通过 pip install bitsandbytes 进行安装，并确保能够访问兼容CUDA的GPU设备。只需将上述代码片段修改为：

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_in_4bit=True
)

使用 Flash-Attention 2 进一步加速生成

首先确保已安装 flash-attn。有关该软件包的安装，请参考 Flash Attention 的原始仓库。只需将上述代码片段修改为：

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attention_2=True
).to(0)

🔒 许可证

✏️ 引用

如果您发现我们的论文和代码对您的研究有所帮助：

@misc{zhang2024llavanextvideo,
  title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
  url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
  author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
  month={April},
  year={2024}
}

@misc{liu2024llavanext,
    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
    month={January},
    year={2024}
}

LLaVA-NeXT-Video 模型卡片

📄 模型详情

📚 训练数据集

图像

视频

📊 评估数据集

🚀 如何使用模型

使用图像作为输入进行推理

使用图像和视频作为输入进行推理

模型优化

通过 bitsandbytes 库实现4-bit量化

使用 Flash-Attention 2 进一步加速生成

🔒 许可证

✏️ 引用

LLaVA-NeXT-Video 模型卡片

📄 模型详情

📚 训练数据集

图像

视频

📊 评估数据集

🚀 如何使用模型

使用图像作为输入进行推理

使用图像和视频作为输入进行推理

模型优化

通过 bitsandbytes 库实现4-bit量化

使用 Flash-Attention 2 进一步加速生成

🔒 许可证

✏️ 引用

通过 `bitsandbytes` 库实现4-bit量化

通过 `bitsandbytes` 库实现4-bit量化