Qwen2-VL-7B-Instruct

简介

我们非常荣幸地推出Qwen2-VL，这是Qwen-VL模型的最新迭代版本，凝聚了近一年的创新成果。

Qwen2-VL有哪些新特性？

核心增强：

业界领先的多分辨率与多比例图像理解能力：Qwen2-VL在视觉理解基准测试中表现卓越，包括MathVista、DocVQA、RealWorldQA、MTVQA等。
支持20分钟以上视频理解：Qwen2-VL能够理解超过20分钟的视频内容，可实现高质量的基于视频的问答、对话、内容创作等功能。
可操控移动设备、机器人等的智能体：凭借强大的复杂推理与决策能力，Qwen2-VL可集成到手机、机器人等设备中，根据视觉环境和文本指令实现自动操作。
多语言支持：为服务全球用户，除英语和中文外，Qwen2-VL现在还支持识别图像中多种语言的文本，包括大部分欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新：

原生动态分辨率（Naive Dynamic Resolution）：与以往不同，Qwen2-VL能够处理任意图像分辨率，将其映射为动态数量的视觉 tokens，提供更接近人类的视觉处理体验。
多模态旋转位置编码（Multimodal Rotary Position Embedding, M-ROPE）：将位置编码分解为不同部分，以捕捉一维文本、二维视觉和三维视频的位置信息，增强其多模态处理能力。

我们提供三种参数规模的模型：20亿、70亿和720亿。本仓库包含经指令微调的70亿参数Qwen2-VL模型。更多信息，请访问我们的博客和GitHub。

评测

图像基准测试

基准测试	InternVL2-8B	MiniCPM-V 2.6	GPT-4o-mini	Qwen2-VL-7B
MMMU_val	51.8	49.8	60	54.1
DocVQA_test	91.6	90.8	-	94.5
InfoVQA_test	74.8	-	-	76.5
ChartQA_test	83.3	-	-	83.0
TextVQA_val	77.4	80.1	-	84.3
OCRBench	794	852	785	845
MTVQA	-	-	-	26.3
RealWorldQA	64.4	-	-	70.1
MME_sum	2210.3	2348.4	2003.4	2326.8
MMBench-EN_test	81.7	-	-	83.0
MMBench-CN_test	81.2	-	-	80.5
MMBench-V1.1_test	79.4	78.0	76.0	80.7
MMT-Bench_test	-	-	-	63.7
MMStar	61.5	57.5	54.8	60.7
MMVet_GPT-4-Turbo	54.2	60.0	66.9	62.0
HallBench_avg	45.2	48.1	46.1	50.6
MathVista_testmini	58.3	60.6	52.4	58.2
MathVision	-	-	-	16.3

视频基准测试

基准测试	Internvl2-8B	LLaVA-OneVision-7B	MiniCPM-V 2.6	Qwen2-VL-7B
MVBench	66.4	56.7	-	67.0
PerceptionTest_test	-	57.1	-	62.3
EgoSchema_test	-	60.1	-	66.7
Video-MME_{wo/w subs}	54.0/56.9	58.2/-	60.9/63.6	63.3/69.0

环境要求

Qwen2-VL的代码已包含在最新版的Hugging Face Transformers中，建议通过源代码安装，命令为pip install git+https://github.com/huggingface/transformers，否则可能会遇到以下错误：

KeyError: 'qwen2_vl'

快速入门

我们提供了一个工具包，可帮助您更便捷地处理各种类型的视觉输入，使用体验如同调用 API 一般。该工具包支持 base64、URL 以及图文/视频交错等格式。您可以通过以下命令进行安装：

pip install qwen-vl-utils

以下是使用 transformers 和 qwen_vl_utils 调用对话模型的代码示例：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download
model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct")

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     model_dir,
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained(model_dir)

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("npu")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用 qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from modelscope import snapshot_download
model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct")
# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_dir)

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("npu")

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多图像推理

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("npu")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("npu")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("npu")

# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

局限性

尽管Qwen2-VL适用于多种视觉任务，但了解其局限性同样重要。以下是一些已知的限制：

缺乏音频支持：当前模型无法理解视频中的音频信息。
数据时效性：我们的图像数据集更新至2023年6月，此日期之后的信息可能未被涵盖。
人物与知识产权（IP）识别限制：模型对特定人物或IP的识别能力有限，可能无法全面覆盖所有知名人物或品牌。
复杂指令理解能力有限：面对复杂的多步骤指令时，模型的理解和执行能力有待提升。
计数准确性不足：尤其在复杂场景中，物体计数的准确性不高，需要进一步改进。
空间推理能力较弱：特别是在3D空间中，模型对物体位置关系的推断不够充分，难以精确判断物体间的相对位置。

这些局限性是模型持续优化和改进的方向，我们致力于不断提升模型的性能和应用范围。

引用

如果您觉得我们的工作有帮助，欢迎引用。

@article{Qwen2-VL,
  title={Qwen2-VL},
  author={Qwen team},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}

Qwen2-VL-7B-Instruct

简介

我们非常荣幸地推出Qwen2-VL，这是Qwen-VL模型的最新迭代版本，凝聚了近一年的创新成果。

Qwen2-VL有哪些新特性？

核心增强：

业界领先的多分辨率与多比例图像理解能力：Qwen2-VL在视觉理解基准测试中表现卓越，包括MathVista、DocVQA、RealWorldQA、MTVQA等。
支持20分钟以上视频理解：Qwen2-VL能够理解超过20分钟的视频内容，可实现高质量的基于视频的问答、对话、内容创作等功能。
可操控移动设备、机器人等的智能体：凭借强大的复杂推理与决策能力，Qwen2-VL可集成到手机、机器人等设备中，根据视觉环境和文本指令实现自动操作。
多语言支持：为服务全球用户，除英语和中文外，Qwen2-VL现在还支持识别图像中多种语言的文本，包括大部分欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新：

原生动态分辨率（Naive Dynamic Resolution）：与以往不同，Qwen2-VL能够处理任意图像分辨率，将其映射为动态数量的视觉 tokens，提供更接近人类的视觉处理体验。
多模态旋转位置编码（Multimodal Rotary Position Embedding, M-ROPE）：将位置编码分解为不同部分，以捕捉一维文本、二维视觉和三维视频的位置信息，增强其多模态处理能力。

我们提供三种参数规模的模型：20亿、70亿和720亿。本仓库包含经指令微调的70亿参数Qwen2-VL模型。更多信息，请访问我们的博客和GitHub。

评测

图像基准测试

基准测试	InternVL2-8B	MiniCPM-V 2.6	GPT-4o-mini	Qwen2-VL-7B
MMMU_val	51.8	49.8	60	54.1
DocVQA_test	91.6	90.8	-	94.5
InfoVQA_test	74.8	-	-	76.5
ChartQA_test	83.3	-	-	83.0
TextVQA_val	77.4	80.1	-	84.3
OCRBench	794	852	785	845
MTVQA	-	-	-	26.3
RealWorldQA	64.4	-	-	70.1
MME_sum	2210.3	2348.4	2003.4	2326.8
MMBench-EN_test	81.7	-	-	83.0
MMBench-CN_test	81.2	-	-	80.5
MMBench-V1.1_test	79.4	78.0	76.0	80.7
MMT-Bench_test	-	-	-	63.7
MMStar	61.5	57.5	54.8	60.7
MMVet_GPT-4-Turbo	54.2	60.0	66.9	62.0
HallBench_avg	45.2	48.1	46.1	50.6
MathVista_testmini	58.3	60.6	52.4	58.2
MathVision	-	-	-	16.3

视频基准测试

基准测试	Internvl2-8B	LLaVA-OneVision-7B	MiniCPM-V 2.6	Qwen2-VL-7B
MVBench	66.4	56.7	-	67.0
PerceptionTest_test	-	57.1	-	62.3
EgoSchema_test	-	60.1	-	66.7
Video-MME_{wo/w subs}	54.0/56.9	58.2/-	60.9/63.6	63.3/69.0

环境要求

KeyError: 'qwen2_vl'

快速入门

pip install qwen-vl-utils

以下是使用 transformers 和 qwen_vl_utils 调用对话模型的代码示例：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download
model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct")

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     model_dir,
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained(model_dir)

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("npu")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用 qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from modelscope import snapshot_download
model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct")
# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_dir)

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("npu")

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多图像推理

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("npu")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("npu")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("npu")

# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

局限性

尽管Qwen2-VL适用于多种视觉任务，但了解其局限性同样重要。以下是一些已知的限制：

缺乏音频支持：当前模型无法理解视频中的音频信息。
数据时效性：我们的图像数据集更新至2023年6月，此日期之后的信息可能未被涵盖。
人物与知识产权（IP）识别限制：模型对特定人物或IP的识别能力有限，可能无法全面覆盖所有知名人物或品牌。
复杂指令理解能力有限：面对复杂的多步骤指令时，模型的理解和执行能力有待提升。
计数准确性不足：尤其在复杂场景中，物体计数的准确性不高，需要进一步改进。
空间推理能力较弱：特别是在3D空间中，模型对物体位置关系的推断不够充分，难以精确判断物体间的相对位置。

这些局限性是模型持续优化和改进的方向，我们致力于不断提升模型的性能和应用范围。

引用

如果您觉得我们的工作有帮助，欢迎引用。

@article{Qwen2-VL,
  title={Qwen2-VL},
  author={Qwen team},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}

Qwen2-VL-7B-Instruct

简介

Qwen2-VL有哪些新特性？

核心增强：

模型架构更新：

评测

图像基准测试

视频基准测试

环境要求

快速入门

更多使用技巧

提升性能的图像分辨率设置

局限性

引用

Qwen2-VL-7B-Instruct

简介

Qwen2-VL有哪些新特性？

核心增强：

模型架构更新：

评测

图像基准测试

视频基准测试

环境要求

快速入门

更多使用技巧

提升性能的图像分辨率设置

局限性

引用