我们非常荣幸地推出Qwen2-VL,这是Qwen-VL模型的最新迭代版本,凝聚了近一年的创新成果。
业界领先的多分辨率与多比例图像理解能力:Qwen2-VL在视觉理解基准测试中表现卓越,包括MathVista、DocVQA、RealWorldQA、MTVQA等。
支持20分钟以上视频理解:Qwen2-VL能够理解超过20分钟的视频内容,可实现高质量的基于视频的问答、对话、内容创作等功能。
可操控移动设备、机器人等的智能体:凭借强大的复杂推理与决策能力,Qwen2-VL可集成到手机、机器人等设备中,根据视觉环境和文本指令实现自动操作。
多语言支持:为服务全球用户,除英语和中文外,Qwen2-VL现在还支持识别图像中多种语言的文本,包括大部分欧洲语言、日语、韩语、阿拉伯语、越南语等。
原生动态分辨率(Naive Dynamic Resolution):与以往不同,Qwen2-VL能够处理任意图像分辨率,将其映射为动态数量的视觉 tokens,提供更接近人类的视觉处理体验。
多模态旋转位置编码(Multimodal Rotary Position Embedding, M-ROPE):将位置编码分解为不同部分,以捕捉一维文本、二维视觉和三维视频的位置信息,增强其多模态处理能力。

我们提供三种参数规模的模型:20亿、70亿和720亿。本仓库包含经指令微调的70亿参数Qwen2-VL模型。更多信息,请访问我们的博客和GitHub。
| 基准测试 | InternVL2-8B | MiniCPM-V 2.6 | GPT-4o-mini | Qwen2-VL-7B |
|---|---|---|---|---|
| MMMUval | 51.8 | 49.8 | 60 | 54.1 |
| DocVQAtest | 91.6 | 90.8 | - | 94.5 |
| InfoVQAtest | 74.8 | - | - | 76.5 |
| ChartQAtest | 83.3 | - | - | 83.0 |
| TextVQAval | 77.4 | 80.1 | - | 84.3 |
| OCRBench | 794 | 852 | 785 | 845 |
| MTVQA | - | - | - | 26.3 |
| RealWorldQA | 64.4 | - | - | 70.1 |
| MMEsum | 2210.3 | 2348.4 | 2003.4 | 2326.8 |
| MMBench-ENtest | 81.7 | - | - | 83.0 |
| MMBench-CNtest | 81.2 | - | - | 80.5 |
| MMBench-V1.1test | 79.4 | 78.0 | 76.0 | 80.7 |
| MMT-Benchtest | - | - | - | 63.7 |
| MMStar | 61.5 | 57.5 | 54.8 | 60.7 |
| MMVetGPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 |
| HallBenchavg | 45.2 | 48.1 | 46.1 | 50.6 |
| MathVistatestmini | 58.3 | 60.6 | 52.4 | 58.2 |
| MathVision | - | - | - | 16.3 |
| 基准测试 | Internvl2-8B | LLaVA-OneVision-7B | MiniCPM-V 2.6 | Qwen2-VL-7B |
|---|---|---|---|---|
| MVBench | 66.4 | 56.7 | - | 67.0 |
| PerceptionTesttest | - | 57.1 | - | 62.3 |
| EgoSchematest | - | 60.1 | - | 66.7 |
| Video-MMEwo/w subs | 54.0/56.9 | 58.2/- | 60.9/63.6 | 63.3/69.0 |
Qwen2-VL的代码已包含在最新版的Hugging Face Transformers中,建议通过源代码安装,命令为pip install git+https://github.com/huggingface/transformers,否则可能会遇到以下错误:
KeyError: 'qwen2_vl'我们提供了一个工具包,可帮助您更便捷地处理各种类型的视觉输入,使用体验如同调用 API 一般。该工具包支持 base64、URL 以及图文/视频交错等格式。您可以通过以下命令进行安装:
pip install qwen-vl-utils以下是使用 transformers 和 qwen_vl_utils 调用对话模型的代码示例:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download
model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct")
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_dir, torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# model_dir,
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processer
processor = AutoProcessor.from_pretrained(model_dir)
# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("npu")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from modelscope import snapshot_download
model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct")
# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_dir, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_dir)
# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("npu")
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)# Messages containing multiple images and a text query
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "Identify the similarities between these images."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("npu")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)# Messages containing a images list as a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": [
"file:///path/to/frame1.jpg",
"file:///path/to/frame2.jpg",
"file:///path/to/frame3.jpg",
"file:///path/to/frame4.jpg",
],
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Messages containing a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"max_pixels": 360 * 420,
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("npu")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)# Sample messages for batch inference
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "What are the common elements in these pictures?"},
],
}
]
messages2 = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]
# Preparation for batch inference
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("npu")
# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)对于输入图像,我们支持本地文件、base64 和 URL。对于视频,目前仅支持本地文件。
# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
## Image URL
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "http://path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
## Base64 encoded image
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "data:image;base64,/9j/..."},
{"type": "text", "text": "Describe this image."},
],
}
]该模型支持多种分辨率输入。默认情况下,模型采用图像的原生分辨率进行输入,但更高的分辨率可以提升性能,不过会增加计算量。用户可以通过设置最小和最大像素数量来实现符合自身需求的优化配置,例如将token数量范围设为256-1280,从而在速度和内存占用之间取得平衡。
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
model_dir, min_pixels=min_pixels, max_pixels=max_pixels
)此外,我们提供两种方法对输入模型的图像尺寸进行精细控制:
定义 min_pixels 和 max_pixels:图像将被调整大小,以保持其宽高比在 min_pixels 和 max_pixels 的范围内。
指定精确尺寸:直接设置 resized_height 和 resized_width。这些值将被四舍五入到最接近的 28 的倍数。
# min_pixels and max_pixels
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///path/to/your/image.jpg",
"resized_height": 280,
"resized_width": 420,
},
{"type": "text", "text": "Describe this image."},
],
}
]
# resized_height and resized_width
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///path/to/your/image.jpg",
"min_pixels": 50176,
"max_pixels": 50176,
},
{"type": "text", "text": "Describe this image."},
],
}
]尽管Qwen2-VL适用于多种视觉任务,但了解其局限性同样重要。以下是一些已知的限制:
这些局限性是模型持续优化和改进的方向,我们致力于不断提升模型的性能和应用范围。
如果您觉得我们的工作有帮助,欢迎引用。
@article{Qwen2-VL,
title={Qwen2-VL},
author={Qwen team},
year={2024}
}
@article{Qwen-VL,
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023}
}