InternVL2_5-78B-AWQ

模型名称	视觉部分	语言部分	HF 链接
InternVL2_5-1B	InternViT-300M-448px-V2_5	Qwen2.5-0.5B-Instruct	🤗 链接
InternVL2_5-2B	InternViT-300M-448px-V2_5	internlm2_5-1_8b-chat	🤗 链接
InternVL2_5-4B	InternViT-300M-448px-V2_5	Qwen2.5-3B-Instruct	🤗 链接
InternVL2_5-8B	InternViT-300M-448px-V2_5	internlm2_5-7b-chat	🤗 链接
InternVL2_5-26B	InternViT-6B-448px-V2_5	internlm2_5-20b-chat	🤗 链接
InternVL2_5-38B	InternViT-6B-448px-V2_5	Qwen2.5-32B-Instruct	🤗 链接
InternVL2_5-78B	InternViT-6B-448px-V2_5	Qwen2.5-72B-Instruct	🤗 链接

模型架构

如下图所示，InternVL 2.5 沿用了前代产品 InternVL 1.5 和 2.0 的模型架构，遵循“ViT-MLP-LLM”范式。在这一版本中，我们将新的增量预训练 InternViT 与多种预训练 LLM（包括 InternLM 2.5 和 Qwen 2.5）通过随机初始化的 MLP 投影器进行整合。

image/png

与之前的版本一样，我们应用了像素重排操作，将视觉 token 的数量减少到原始数量的四分之一。此外，我们采用了与 InternVL 1.5 类似的动态分辨率策略，将图像分割为 448×448 像素的图块。从 InternVL 2.0 开始的关键区别在于，我们额外引入了对多图像和视频数据的支持。

训练策略

多模态数据的动态高分辨率处理

在 InternVL 2.0 和 2.5 中，我们扩展了动态高分辨率训练方法，增强了其处理多图像和视频数据集的能力。

image/png

对于单图像数据集，图块总数 n_max 分配给单张图像以获得最大分辨率。视觉 token 被包裹在 <img> 和 </img> 标签中。
对于多图像数据集，图块总数 n_max 在样本中的所有图像之间分配。每张图像都标有 Image-1 等辅助标签，并被包裹在 <img> 和 </img> 标签中。
对于视频，每一帧都调整为 448×448 大小。帧标有 Frame-1 等标签，并与图像类似地被包裹在 <img> 和 </img> 标签中。

单模型训练流程

InternVL 2.5 中单个模型的训练流程分为三个阶段，旨在增强模型的视觉感知和多模态能力。

image/png

阶段 1：MLP 预热。 在此阶段，仅训练 MLP 投影器，而视觉编码器和语言模型保持冻结。为了获得更好的性能，应用了动态高分辨率训练策略，尽管这会增加成本。此阶段确保了稳健的跨模态对齐，并为稳定的多模态训练做好准备。
阶段 1.5：ViT 增量学习（可选）。 此阶段允许使用与阶段 1 相同的数据对视觉编码器和 MLP 投影器进行增量训练。它增强了编码器处理多语言 OCR 和数学图表等罕见领域的能力。训练完成后，该编码器可以在不同的 LLM 之间重用而无需重新训练，因此除非引入新领域，否则此阶段为可选。
阶段 2：全模型指令微调。 整个模型在高质量的多模态指令数据集上进行训练。实施严格的数据质量控制以防止 LLM 性能下降，因为噪声数据可能导致输出重复或错误等问题。此阶段完成后，训练过程即告结束。

渐进式扩展策略

我们提出了一种渐进式扩展策略，以高效对齐视觉编码器与大语言模型（LLMs）。该方法首先使用较小的LLM（如20B）进行训练，优化基础视觉能力和跨模态对齐，然后将视觉编码器迁移到更大的LLM（如72B），无需重新训练。这种复用方式省去了大型模型的中间训练阶段。

image/png

与Qwen2-VL使用的1.4万亿tokens相比，InternVL2.5-78B仅使用1200亿tokens，不足前者的十分之一。该策略最大限度地减少了冗余，最大化了预训练组件的复用，并支持复杂视觉语言任务的高效训练。

训练增强

为提高模型的实际应用适应性和性能，我们引入了两项关键技术：

随机JPEG压缩：将质量水平在75到100之间的随机JPEG压缩作为一种数据增强技术。这模拟了来自互联网资源的图像退化，增强了模型对含噪图像的鲁棒性。
损失重加权：为平衡不同长度响应的NTP损失，我们采用了一种名为平方平均的重加权策略。该方法平衡了不同长度响应的贡献，减轻了对长响应或短响应的偏向。

数据组织

数据集配置

在InternVL 2.0和2.5中，训练数据的组织由几个关键参数控制，以优化训练期间数据集的平衡和分布。

image/png

数据增强：JPEG压缩是有条件应用的：对图像数据集启用以增强鲁棒性，对视频数据集禁用以保持一致的帧质量。
最大分块数量：参数n_max控制每个数据集的最大分块数。例如，多图像或高分辨率数据使用较高值（24–36），标准图像使用较低值（6–12），视频则使用1。
重复因子：重复因子r调整数据集的采样频率。小于1的值会降低数据集的权重，大于1的值则会增加其权重。这确保了不同任务间的均衡训练，防止过拟合或欠拟合。

数据过滤 pipeline

在开发过程中，我们发现LLM对数据噪声高度敏感，即使是微小的异常（如离群值或重复数据）也会导致推理时的异常行为。特别是在长文本生成或思维链（CoT）推理任务中，重复生成问题尤为有害。

image/png

为应对这一挑战并支持未来研究，我们设计了一个高效的数据过滤pipeline来移除低质量样本。

image/png

该pipeline包含两个模块，对于纯文本数据，使用三种关键策略：

基于LLM的质量评分：每个样本由预训练的LLM结合领域特定提示进行评分（0–10）。评分低于阈值（如7）的样本将被移除，以确保数据质量。
重复检测：使用基于LLM的提示标记重复样本并进行人工审核。评分低于更严格阈值（如3）的样本将被排除，以避免重复模式。
基于启发式规则的过滤：使用规则检测异常，如异常句子长度或重复行。标记的样本会经过人工验证，确保准确后再移除。

对于多模态数据，使用两种策略：

重复检测：对非学术数据集中的重复样本进行标记和人工审核，防止模式循环。高质量数据集可豁免此过程。
基于启发式规则的过滤：应用类似规则检测视觉异常，标记的数据需人工验证以保持完整性。

训练数据

如下图所示，从InternVL 1.5到2.0再到2.5，微调数据混合在规模、质量和多样性方面经历了迭代改进。有关训练数据的更多信息，请参考我们的技术报告。

image/png

多模态能力评估

多模态推理与数学能力

image/png

OCR、图表与文档理解

image/png

多图像与真实世界理解

image/png

综合多模态与幻觉评估

image/png

视觉定位

image/png

多模态多语言理解

image/png

视频理解

image/png

语言能力评估

训练 InternVL 2.0 模型导致纯语言能力有所下降。InternVL 2.5 通过收集更多高质量开源数据并过滤低质量数据来解决这一问题，实现了纯语言性能的更好保留。

image/png

部署

LMDeploy

LMDeploy 是一个用于压缩、部署和服务 LLMs 与 VLMs 的工具包。

pip install lmdeploy>=0.6.4

LMDeploy 将多模态视觉语言模型（VLM）复杂的推理过程抽象为易于使用的流水线，类似于大语言模型（LLM）推理流水线。

“Hello, world”示例

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-78B-AWQ'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
response = pipe(('describe this image', image))
print(response.text)

如果执行此用例时出现 ImportError，请根据提示安装所需的依赖包。

多图推理

处理多张图像时，可将所有图像放入一个列表中。请注意，多张图像会导致输入 token 数量增加，因此通常需要增大上下文窗口的大小。

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL2_5-78B-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

批量提示词推理

使用批量提示词进行推理非常简单，只需将它们放在列表结构中即可：

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-78B-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

多轮对话

使用该流水线进行多轮对话有两种方式。一种是按照 OpenAI 的格式构建消息，并使用上述介绍的方法；另一种是使用 pipeline.chat 接口。

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-78B-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

服务

LMDeploy 的 api_server 能够通过单条命令轻松将模型打包为服务。所提供的 RESTful API 兼容 OpenAI 的接口。以下是服务启动示例：

lmdeploy serve api_server OpenGVLab/InternVL2_5-78B-AWQ --server-port 23333 --tp 4

要使用 OpenAI 风格的接口，您需要安装 OpenAI：

pip install openai

然后，使用以下代码进行 API 调用：

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

许可协议

本项目基于 MIT 许可协议发布。本项目使用预训练模型 Qwen2.5-72B-Instruct 作为组件，该模型基于 Qwen 许可协议授权。

引用

如果您在研究中发现本项目有帮助，请考虑引用：

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{gao2024mini,
  title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance},
  author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
  journal={arXiv preprint arXiv:2410.16261},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}

InternVL2_5-78B-AWQ

[

$📂 GitHub$

](https://github.com/OpenGVLab/InternVL) [

$📜 InternVL 1.0$

](https://huggingface.co/papers/2312.14238) [

$📜 InternVL 1.5$

](https://huggingface.co/papers/2404.16821) [

$📜 Mini-InternVL$

](https://arxiv.org/abs/2410.16261) [

$📜 InternVL 2.5$

](https://huggingface.co/papers/2412.05271)

[

$🆕 博客$

](https://internvl.github.io/blog/) [

$🗨️ 对话演示$

](https://internvl.opengvlab.com/) [

$🤗 HF 演示$

](https://huggingface.co/spaces/OpenGVLab/InternVL) [

$🚀 快速开始$

](#quick-start) [

$📖 文档$

](https://internvl.readthedocs.io/en/latest/)

简介

image/png

InternVL 2.5 系列

下表为您概述 InternVL 2.5 系列的相关信息。

模型名称	视觉部分	语言部分	HF 链接
InternVL2_5-1B	InternViT-300M-448px-V2_5	Qwen2.5-0.5B-Instruct	🤗 链接
InternVL2_5-2B	InternViT-300M-448px-V2_5	internlm2_5-1_8b-chat	🤗 链接
InternVL2_5-4B	InternViT-300M-448px-V2_5	Qwen2.5-3B-Instruct	🤗 链接
InternVL2_5-8B	InternViT-300M-448px-V2_5	internlm2_5-7b-chat	🤗 链接
InternVL2_5-26B	InternViT-6B-448px-V2_5	internlm2_5-20b-chat	🤗 链接
InternVL2_5-38B	InternViT-6B-448px-V2_5	Qwen2.5-32B-Instruct	🤗 链接
InternVL2_5-78B	InternViT-6B-448px-V2_5	Qwen2.5-72B-Instruct	🤗 链接

模型架构

image/png

训练策略

多模态数据的动态高分辨率处理

在 InternVL 2.0 和 2.5 中，我们扩展了动态高分辨率训练方法，增强了其处理多图像和视频数据集的能力。

image/png

对于单图像数据集，图块总数 n_max 分配给单张图像以获得最大分辨率。视觉 token 被包裹在 <img> 和 </img> 标签中。
对于多图像数据集，图块总数 n_max 在样本中的所有图像之间分配。每张图像都标有 Image-1 等辅助标签，并被包裹在 <img> 和 </img> 标签中。
对于视频，每一帧都调整为 448×448 大小。帧标有 Frame-1 等标签，并与图像类似地被包裹在 <img> 和 </img> 标签中。

单模型训练流程

InternVL 2.5 中单个模型的训练流程分为三个阶段，旨在增强模型的视觉感知和多模态能力。

image/png

阶段 1：MLP 预热。 在此阶段，仅训练 MLP 投影器，而视觉编码器和语言模型保持冻结。为了获得更好的性能，应用了动态高分辨率训练策略，尽管这会增加成本。此阶段确保了稳健的跨模态对齐，并为稳定的多模态训练做好准备。
阶段 1.5：ViT 增量学习（可选）。 此阶段允许使用与阶段 1 相同的数据对视觉编码器和 MLP 投影器进行增量训练。它增强了编码器处理多语言 OCR 和数学图表等罕见领域的能力。训练完成后，该编码器可以在不同的 LLM 之间重用而无需重新训练，因此除非引入新领域，否则此阶段为可选。
阶段 2：全模型指令微调。 整个模型在高质量的多模态指令数据集上进行训练。实施严格的数据质量控制以防止 LLM 性能下降，因为噪声数据可能导致输出重复或错误等问题。此阶段完成后，训练过程即告结束。

渐进式扩展策略

image/png

训练增强

为提高模型的实际应用适应性和性能，我们引入了两项关键技术：

随机JPEG压缩：将质量水平在75到100之间的随机JPEG压缩作为一种数据增强技术。这模拟了来自互联网资源的图像退化，增强了模型对含噪图像的鲁棒性。
损失重加权：为平衡不同长度响应的NTP损失，我们采用了一种名为平方平均的重加权策略。该方法平衡了不同长度响应的贡献，减轻了对长响应或短响应的偏向。

数据组织

数据集配置

在InternVL 2.0和2.5中，训练数据的组织由几个关键参数控制，以优化训练期间数据集的平衡和分布。

image/png

数据增强：JPEG压缩是有条件应用的：对图像数据集启用以增强鲁棒性，对视频数据集禁用以保持一致的帧质量。
最大分块数量：参数n_max控制每个数据集的最大分块数。例如，多图像或高分辨率数据使用较高值（24–36），标准图像使用较低值（6–12），视频则使用1。
重复因子：重复因子r调整数据集的采样频率。小于1的值会降低数据集的权重，大于1的值则会增加其权重。这确保了不同任务间的均衡训练，防止过拟合或欠拟合。

数据过滤 pipeline

image/png

为应对这一挑战并支持未来研究，我们设计了一个高效的数据过滤pipeline来移除低质量样本。

image/png

该pipeline包含两个模块，对于纯文本数据，使用三种关键策略：

基于LLM的质量评分：每个样本由预训练的LLM结合领域特定提示进行评分（0–10）。评分低于阈值（如7）的样本将被移除，以确保数据质量。
重复检测：使用基于LLM的提示标记重复样本并进行人工审核。评分低于更严格阈值（如3）的样本将被排除，以避免重复模式。
基于启发式规则的过滤：使用规则检测异常，如异常句子长度或重复行。标记的样本会经过人工验证，确保准确后再移除。

对于多模态数据，使用两种策略：

重复检测：对非学术数据集中的重复样本进行标记和人工审核，防止模式循环。高质量数据集可豁免此过程。
基于启发式规则的过滤：应用类似规则检测视觉异常，标记的数据需人工验证以保持完整性。

训练数据

如下图所示，从InternVL 1.5到2.0再到2.5，微调数据混合在规模、质量和多样性方面经历了迭代改进。有关训练数据的更多信息，请参考我们的技术报告。

image/png

多模态能力评估

多模态推理与数学能力

image/png

OCR、图表与文档理解

image/png

多图像与真实世界理解

image/png

综合多模态与幻觉评估

image/png

视觉定位

image/png

多模态多语言理解

image/png

视频理解

image/png

语言能力评估

image/png

部署

LMDeploy

LMDeploy 是一个用于压缩、部署和服务 LLMs 与 VLMs 的工具包。

pip install lmdeploy>=0.6.4

LMDeploy 将多模态视觉语言模型（VLM）复杂的推理过程抽象为易于使用的流水线，类似于大语言模型（LLM）推理流水线。

“Hello, world”示例

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-78B-AWQ'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
response = pipe(('describe this image', image))
print(response.text)

如果执行此用例时出现 ImportError，请根据提示安装所需的依赖包。

多图推理

处理多张图像时，可将所有图像放入一个列表中。请注意，多张图像会导致输入 token 数量增加，因此通常需要增大上下文窗口的大小。

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL2_5-78B-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

批量提示词推理

使用批量提示词进行推理非常简单，只需将它们放在列表结构中即可：

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-78B-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

多轮对话

使用该流水线进行多轮对话有两种方式。一种是按照 OpenAI 的格式构建消息，并使用上述介绍的方法；另一种是使用 pipeline.chat 接口。

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL2_5-78B-AWQ'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

服务

LMDeploy 的 api_server 能够通过单条命令轻松将模型打包为服务。所提供的 RESTful API 兼容 OpenAI 的接口。以下是服务启动示例：

lmdeploy serve api_server OpenGVLab/InternVL2_5-78B-AWQ --server-port 23333 --tp 4

要使用 OpenAI 风格的接口，您需要安装 OpenAI：

pip install openai

然后，使用以下代码进行 API 调用：

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

许可协议

本项目基于 MIT 许可协议发布。本项目使用预训练模型 Qwen2.5-72B-Instruct 作为组件，该模型基于 Qwen 许可协议授权。

引用

如果您在研究中发现本项目有帮助，请考虑引用：

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{gao2024mini,
  title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance},
  author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
  journal={arXiv preprint arXiv:2410.16261},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}