我们非常荣幸地推出Qwen2-VL,这是Qwen-VL模型的最新迭代版本,凝聚了近一年的创新成果。
多种分辨率与宽高比图像的SOTA理解能力:Qwen2-VL在视觉理解基准测试中表现卓越,包括MathVista、DocVQA、RealWorldQA、MTVQA等。
支持20分钟以上视频理解:Qwen2-VL能够理解超过20分钟的视频内容,可实现高质量的视频问答、对话、内容创作等功能。
可操控手机、机器人等设备的智能体:凭借强大的复杂推理与决策能力,Qwen2-VL可集成到手机、机器人等设备中,根据视觉环境和文本指令实现自动化操作。
多语言支持:为服务全球用户,除中英文外,Qwen2-VL现在还支持识别图像中多种语言的文本,包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。
我们提供了参数规模分别为20亿、70亿和720亿的三种模型。本仓库包含经指令微调的70亿参数Qwen2-VL模型。更多信息,请访问我们的博客和GitHub。
| 基准测试 | InternVL2-8B | MiniCPM-V 2.6 | GPT-4o-mini | Qwen2-VL-7B |
|---|---|---|---|---|
| MMMUval | 51.8 | 49.8 | 60 | 54.1 |
| DocVQAtest | 91.6 | 90.8 | - | 94.5 |
| InfoVQAtest | 74.8 | - | - | 76.5 |
| ChartQAtest | 83.3 | - | - | 83.0 |
| TextVQAval | 77.4 | 80.1 | - | 84.3 |
| OCRBench | 794 | 852 | 785 | 845 |
| MTVQA | - | - | - | 26.3 |
| VCRen easy | - | 73.88 | 83.60 | 89.70 |
| VCRzh easy | - | 10.18 | 1.10 | 59.94 |
| RealWorldQA | 64.4 | - | - | 70.1 |
| MMEsum | 2210.3 | 2348.4 | 2003.4 | 2326.8 |
| MMBench-ENtest | 81.7 | - | - | 83.0 |
| MMBench-CNtest | 81.2 | - | - | 80.5 |
| MMBench-V1.1test | 79.4 | 78.0 | 76.0 | 80.7 |
| MMT-Benchtest | - | - | - | 63.7 |
| MMStar | 61.5 | 57.5 | 54.8 | 60.7 |
| MMVetGPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 |
| HallBenchavg | 45.2 | 48.1 | 46.1 | 50.6 |
| MathVistatestmini | 58.3 | 60.6 | 52.4 | 58.2 |
| MathVision | - | - | - | 16.3 |
| 基准测试 | Internvl2-8B | LLaVA-OneVision-7B | MiniCPM-V 2.6 | Qwen2-VL-7B |
|---|---|---|---|---|
| MVBench | 66.4 | 56.7 | - | 67.0 |
| PerceptionTesttest | - | 57.1 | - | 62.3 |
| EgoSchematest | - | 60.1 | - | 66.7 |
| Video-MMEwo/w subs | 54.0/56.9 | 58.2/- | 60.9/63.6 | 63.3/69.0 |
Qwen2-VL 的代码已包含在最新版的 Hugging face transformers 中,建议通过源码安装,命令为 pip install git+https://github.com/huggingface/transformers,否则可能会遇到以下错误:
KeyError: 'qwen2_vl'我们提供了一个工具包,帮助您更便捷地处理各类视觉输入,包括 base64、URL 以及交错的图像和视频。您可以通过以下命令安装该工具包:
pip install qwen-vl-utils以下为使用 transformers 和 qwen_vl_utils 调用对话模型的代码示例:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "Qwen/Qwen2-VL-7B-Instruct",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)# Messages containing multiple images and a text query
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "Identify the similarities between these images."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)# Messages containing a images list as a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": [
"file:///path/to/frame1.jpg",
"file:///path/to/frame2.jpg",
"file:///path/to/frame3.jpg",
"file:///path/to/frame4.jpg",
],
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Messages containing a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"max_pixels": 360 * 420,
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)# Sample messages for batch inference
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "What are the common elements in these pictures?"},
],
}
]
messages2 = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]
# Preparation for batch inference
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)对于输入图片,我们支持本地文件、base64 和 URL。对于视频,目前仅支持本地文件。
# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
## Image URL
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "http://path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
## Base64 encoded image
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "data:image;base64,/9j/..."},
{"type": "text", "text": "Describe this image."},
],
}
]该模型支持多种分辨率输入。默认情况下,它使用输入的原生分辨率,但更高的分辨率可以提升性能,不过会增加计算量。用户可以设置最小和最大像素数量,以实现满足自身需求的最佳配置,例如将 token 数量范围设为 256-1280,从而平衡速度和内存使用。
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)此外,我们提供了两种方法来对输入模型的图像尺寸进行精细化控制:
定义 min_pixels 和 max_pixels:图像将被调整大小,以保持其宽高比在 min_pixels 和 max_pixels 的范围内。
指定精确尺寸:直接设置 resized_height 和 resized_width。这些值将被四舍五入到最接近的 28 的倍数。
# min_pixels and max_pixels
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///path/to/your/image.jpg",
"resized_height": 280,
"resized_width": 420,
},
{"type": "text", "text": "Describe this image."},
],
}
]
# resized_height and resized_width
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///path/to/your/image.jpg",
"min_pixels": 50176,
"max_pixels": 50176,
},
{"type": "text", "text": "Describe this image."},
],
}
]尽管Qwen2-VL适用于多种视觉任务,但了解其局限性也同样重要。以下是一些已知的限制:
这些局限性是模型持续优化和改进的方向,我们致力于不断提升模型的性能和应用范围。
如果您觉得我们的工作有帮助,欢迎引用。
@article{Qwen2VL,
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2409.12191},
year={2024}
}
@article{Qwen-VL,
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023}
}