s
starmountain/Qwen2-VL-7B-Instruct
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

Qwen2-VL-7B-Instruct

Chat

简介

我们非常荣幸地推出Qwen2-VL,这是Qwen-VL模型的最新迭代版本,凝聚了近一年的创新成果。

Qwen2-VL有哪些新特性?

核心增强:

  • 多种分辨率与宽高比图像的SOTA理解能力:Qwen2-VL在视觉理解基准测试中表现卓越,包括MathVista、DocVQA、RealWorldQA、MTVQA等。

  • 支持20分钟以上视频理解:Qwen2-VL能够理解超过20分钟的视频内容,可实现高质量的视频问答、对话、内容创作等功能。

  • 可操控手机、机器人等设备的智能体:凭借强大的复杂推理与决策能力,Qwen2-VL可集成到手机、机器人等设备中,根据视觉环境和文本指令实现自动化操作。

  • 多语言支持:为服务全球用户,除中英文外,Qwen2-VL现在还支持识别图像中多种语言的文本,包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新:

  • 原生动态分辨率(Naive Dynamic Resolution):与以往不同,Qwen2-VL能够处理任意图像分辨率,将其映射为动态数量的视觉 tokens,提供更接近人类的视觉处理体验。

  • 多模态旋转位置编码(Multimodal Rotary Position Embedding, M-ROPE):将位置编码分解为不同部分,以捕捉一维文本、二维视觉和三维视频的位置信息,增强其多模态处理能力。

我们提供了参数规模分别为20亿、70亿和720亿的三种模型。本仓库包含经指令微调的70亿参数Qwen2-VL模型。更多信息,请访问我们的博客和GitHub。

评估

图像基准测试

基准测试InternVL2-8BMiniCPM-V 2.6GPT-4o-miniQwen2-VL-7B
MMMUval51.849.86054.1
DocVQAtest91.690.8-94.5
InfoVQAtest74.8--76.5
ChartQAtest83.3--83.0
TextVQAval77.480.1-84.3
OCRBench794852785845
MTVQA---26.3
VCRen easy-73.8883.6089.70
VCRzh easy-10.181.1059.94
RealWorldQA64.4--70.1
MMEsum2210.32348.42003.42326.8
MMBench-ENtest81.7--83.0
MMBench-CNtest81.2--80.5
MMBench-V1.1test79.478.076.080.7
MMT-Benchtest---63.7
MMStar61.557.554.860.7
MMVetGPT-4-Turbo54.260.066.962.0
HallBenchavg45.248.146.150.6
MathVistatestmini58.360.652.458.2
MathVision---16.3

视频基准测试

基准测试Internvl2-8BLLaVA-OneVision-7BMiniCPM-V 2.6Qwen2-VL-7B
MVBench66.456.7-67.0
PerceptionTesttest-57.1-62.3
EgoSchematest-60.1-66.7
Video-MMEwo/w subs54.0/56.958.2/-60.9/63.663.3/69.0

环境要求

Qwen2-VL 的代码已包含在最新版的 Hugging face transformers 中,建议通过源码安装,命令为 pip install git+https://github.com/huggingface/transformers,否则可能会遇到以下错误:

KeyError: 'qwen2_vl'

快速开始

我们提供了一个工具包,帮助您更便捷地处理各类视觉输入,包括 base64、URL 以及交错的图像和视频。您可以通过以下命令安装该工具包:

pip install qwen-vl-utils

以下为使用 transformers 和 qwen_vl_utils 调用对话模型的代码示例:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
不使用 qwen_vl_utils
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
多图像推理
# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
视频推理
# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
批量推理
# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

更多使用技巧

对于输入图片,我们支持本地文件、base64 和 URL。对于视频,目前仅支持本地文件。

# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Image URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "http://path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Base64 encoded image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "data:image;base64,/9j/..."},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

提升性能的图像分辨率

该模型支持多种分辨率输入。默认情况下,它使用输入的原生分辨率,但更高的分辨率可以提升性能,不过会增加计算量。用户可以设置最小和最大像素数量,以实现满足自身需求的最佳配置,例如将 token 数量范围设为 256-1280,从而平衡速度和内存使用。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

此外,我们提供了两种方法来对输入模型的图像尺寸进行精细化控制:

  1. 定义 min_pixels 和 max_pixels:图像将被调整大小,以保持其宽高比在 min_pixels 和 max_pixels 的范围内。

  2. 指定精确尺寸:直接设置 resized_height 和 resized_width。这些值将被四舍五入到最接近的 28 的倍数。

# min_pixels and max_pixels
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# resized_height and resized_width
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

局限性

尽管Qwen2-VL适用于多种视觉任务,但了解其局限性也同样重要。以下是一些已知的限制:

  1. 缺乏音频支持:当前模型无法理解视频中的音频信息。
  2. 数据时效性:我们的图像数据集更新至2023年6月,此日期之后的信息可能未被涵盖。
  3. 人物与知识产权(IP)识别限制:模型识别特定人物或IP的能力有限,可能无法全面覆盖所有知名人物或品牌。
  4. 复杂指令理解能力有限:面对复杂的多步骤指令时,模型的理解和执行能力有待提升。
  5. 计数准确性不足:尤其在复杂场景中,物体计数的准确性不高,需要进一步改进。
  6. 空间推理能力较弱:特别是在3D空间中,模型对物体位置关系的推断不足,难以精确判断物体间的相对位置。

这些局限性是模型持续优化和改进的方向,我们致力于不断提升模型的性能和应用范围。

引用

如果您觉得我们的工作有帮助,欢迎引用。

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}