MiMo-VL-7B-SFT-2508:MiMo-VL 技术报告

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MiMo-VL 技术报告
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🔥🔥🔥MiMo-VL 2508 更新

我们很高兴地宣布对 MiMo-VL 模型（MiMo-VL-7B-RL-2508 和 MiMo-VL-7B-SFT-2508）进行了改进，在多个基准测试中性能得到提升，思维控制能力增强，用户体验也更为出色。

📈 性能提升

MiMo-VL-7B-RL-2508 在图像和视频基准测试中均表现出持续改进，取得了 MMMU 70.6 分和 VideoMME 70.8 分的显著成绩。

完整评估结果可参见下文。

🤔 思维控制功能

新增思维控制功能，用户可通过 no_think 参数关闭模型的推理模式：

思维模式（默认行为）：完整展示推理过程，控制成功率达 100%；
非思维模式：直接给出回应，不显示推理过程，控制成功率达 99.84%

❤️ 用户体验优化

内部 VLM Arena 评分显示，模型在实际场景中的表现有显著提升：

当前模型（MiMo-VL-7B-RL-2508）：1131.2 分
上一版本（MiMo-VL-7B-RL）：1093.9 分

这些更新使视觉语言模型在学术评估和实际应用中更具能力、更灵活且更可靠。

😉 案例研究：这款汽车有哪些吸引人的特点？

Xiaomi SU7

MiMo-VL-7B-RL-2508 的回答：


To determine the appealing features of the Xiaomi SU7, we analyze visual elements and context from the video:
1. Aesthetic Design
- The car boasts a sleek, aerodynamic exterior with smooth, modern lines.
- Distinctive lighting (sharp headlights and a continuous, glowing taillight strip) enhances its futuristic and sporty look.
- Vibrant color and polished finish make it visually striking.
2. Performance & Speed
- Multiple dynamic shots of the car driving at high speeds on diverse roads (desert, forest, city) emphasize agility and power.
- Close - ups of spinning wheels and the engine (with lightning-like effects) suggest advanced engineering and high-performance capabilities (likely electric or tech-driven).
3. Versatility
- The car is shown in varied environments (desert at dusk, forested roads, urban streets), highlighting its adaptability to different terrains and use cases.
4. Technological Innovation
- The engine close-up with electric-like lightning effects implies cutting-edge technology, positioning the car as a tech-forward vehicle (aligning with Xiaomi’s tech brand identity).
5. Brand Appeal
Prominent “Xiaomi SU7” branding leverages Xiaomi’s reputation in consumer tech, adding credibility and allure for tech-savvy audiences.


In summary, the Xiaomi SU7 appeals through its striking design, high-performance capabilities, versatility, technological innovation, and strong brand association.

完整评估结果

模型推荐

MiMo-VL-7B-2508 模型的两个版本现已在 Hugging Face 开源：

🤗 MiMo-VL-7B-RL-2508
- 推荐大多数用户体验和使用。
🤗 MiMo-VL-7B-SFT-2508
- 用户可基于此模型进行 SFT 和 RL 操作。与之前的 SFT 版本相比，该模型表现出更高的 RL 稳定性。

部署参数

temperature=0.3，topp=0.95
系统提示已在 chat_template.json 中设置，无需额外配置。

思维控制

用户可通过在查询后附加 /no_think 来控制思维模式：

思维模式查询（默认）：
"图片中的问题答案是什么？"
非思维模式查询：
"识别图片中的文字。/no_think"

❗️重要提示：/no_think 命令必须是用户消息的最后部分，这意味着 /no_think 之后不应有任何用户内容，如图像或视频。

视觉输入放置

对于包含单张图像或视频的提示，务必将视觉媒体放在文本之前。例如：

✅ 正确示例：

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text",  "text": "Describe the image. /no_think"},
        ],
    }
]

❌ 不当示例：

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text",  "text": "Describe the image. /no_think"},
            {"type": "image", "image": image_path},
        ],
    }
]

I. 引言

本报告将分享我们构建轻量且高性能视觉语言模型（VLM）MiMo-VL-7B的研究成果。MiMo-VL-7B 由三部分组成：(1) 原生分辨率 ViT 编码器，用于保留细粒度视觉细节；(2) MLP 投影器，实现高效跨模态对齐；(3) 我们的 MiMo-7B 语言模型，该模型针对复杂推理任务进行了专门优化。

MiMo-VL-7B 的开发分为两个连续的训练阶段：(1) 四阶段预训练阶段，包括投影器预热、视觉-语言对齐、通用多模态预训练和长上下文监督微调（SFT），最终得到 MiMo-VL-7B-SFT 模型。(2) 后续的后训练阶段，我们引入了混合在线强化学习（MORL）框架——这是一种新颖的框架，可无缝整合涵盖感知准确性、视觉定位精度、逻辑推理能力以及人类/AI 偏好的多种奖励信号，最终得到 MiMo-VL-7B-RL 模型。

我们开源了 MiMo-VL-7B 系列模型，包括 SFT 和 RL 模型的检查点。我们相信本报告及所发布的模型将为开发高性能推理型 VLM 提供有价值的见解，并造福更广泛的社区。

🛤️ 研究过程中的发现

在预训练阶段融入高质量、广覆盖的推理数据对提升模型性能至关重要
- 我们通过识别多样化查询、利用大型推理模型生成带长链思维（CoT）的响应，并采用拒绝采样确保质量，从而构建了高质量推理数据。
- 我们没有将此类数据视为补充性微调数据，而是将大量合成推理数据直接整合到预训练的后期阶段。实验表明，在此阶段进行扩展训练可使性能持续提升，而不会出现饱和。
混合在线强化学习（MORL）能进一步提升模型性能，但实现多维度指标的稳定同步改进仍具挑战
- 我们将 RL 应用于多种能力提升，包括推理、感知、定位以及人类偏好对齐，覆盖文本、图像和视频等多种模态。这种混合训练方法虽能进一步释放模型潜力，但数据域间的干扰仍是亟待解决的问题。

II. 模型详情

模型可在 Huggingface Collections: MiMo-VL 和 ModelScope Collections: MiMo-VL 获取

模型	描述	下载（HuggingFace）	下载（ModelScope）
MiMo-VL-7B-SFT	经过四阶段预训练后具备卓越推理潜力的视觉语言模型	🤗 XiaomiMiMo/MiMo-VL-7B-SFT	🤖️ XiaomiMiMo/MiMo-VL-7B-SFT
MiMo-VL-7B-RL	超越现有开源模型的强化学习模型	🤗 XiaomiMiMo/MiMo-VL-7B-RL	🤖️ XiaomiMiMo/MiMo-VL-7B-RL

III. 评估结果

通用能力

在通用视觉语言理解方面，MiMo-VL-7B 模型取得了开源领域的最先进成果。

推理任务

在多模态推理任务中，SFT 和 RL 模型在所有对比基准测试中均显著优于其他开源基线模型。

[!IMPORTANT] 标有 * 的结果通过我们的评估框架获得。标有 ${\dagger}$ 的任务由 GPT-4o 进行评估。

GUI 任务

MiMo-VL-7B-RL 具备卓越的 GUI 理解与定位能力。作为一款通用型视觉语言模型，MiMo-VL 的性能已达到甚至超越了 GUI 专用模型。

Elo 评分

依托自研评估数据集及 GPT-4o 评判体系，MiMo-VL-7B-RL 在所有参评开源视觉语言模型中获得最高 Elo 评分，在参数规模从 70 亿到 720 亿的模型中均排名第一。

IV. 部署

MiMo-VL-7B 系列在部署与推理方面完全兼容 Qwen2_5_VLForConditionalGeneration 架构。

V. 引用

@misc{coreteam2025mimovltechnicalreport,
      title={MiMo-VL Technical Report}, 
      author={LLM-Core-Team Xiaomi},
      year={2025},
      eprint={2506.03569},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.03569}, 
}

VI. 联系方式

如有任何问题，请通过 mimo@xiaomi.com 与我们联系，或提交 issue。

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MiMo-VL 技术报告
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🔥🔥🔥MiMo-VL 2508 更新

📈 性能提升

MiMo-VL-7B-RL-2508 在图像和视频基准测试中均表现出持续改进，取得了 MMMU 70.6 分和 VideoMME 70.8 分的显著成绩。

完整评估结果可参见下文。

🤔 思维控制功能

新增思维控制功能，用户可通过 no_think 参数关闭模型的推理模式：

思维模式（默认行为）：完整展示推理过程，控制成功率达 100%；
非思维模式：直接给出回应，不显示推理过程，控制成功率达 99.84%

❤️ 用户体验优化

内部 VLM Arena 评分显示，模型在实际场景中的表现有显著提升：

当前模型（MiMo-VL-7B-RL-2508）：1131.2 分
上一版本（MiMo-VL-7B-RL）：1093.9 分

这些更新使视觉语言模型在学术评估和实际应用中更具能力、更灵活且更可靠。

😉 案例研究：这款汽车有哪些吸引人的特点？

Xiaomi SU7

MiMo-VL-7B-RL-2508 的回答：


To determine the appealing features of the Xiaomi SU7, we analyze visual elements and context from the video:
1. Aesthetic Design
- The car boasts a sleek, aerodynamic exterior with smooth, modern lines.
- Distinctive lighting (sharp headlights and a continuous, glowing taillight strip) enhances its futuristic and sporty look.
- Vibrant color and polished finish make it visually striking.
2. Performance & Speed
- Multiple dynamic shots of the car driving at high speeds on diverse roads (desert, forest, city) emphasize agility and power.
- Close - ups of spinning wheels and the engine (with lightning-like effects) suggest advanced engineering and high-performance capabilities (likely electric or tech-driven).
3. Versatility
- The car is shown in varied environments (desert at dusk, forested roads, urban streets), highlighting its adaptability to different terrains and use cases.
4. Technological Innovation
- The engine close-up with electric-like lightning effects implies cutting-edge technology, positioning the car as a tech-forward vehicle (aligning with Xiaomi’s tech brand identity).
5. Brand Appeal
Prominent “Xiaomi SU7” branding leverages Xiaomi’s reputation in consumer tech, adding credibility and allure for tech-savvy audiences.


In summary, the Xiaomi SU7 appeals through its striking design, high-performance capabilities, versatility, technological innovation, and strong brand association.

完整评估结果

模型推荐

MiMo-VL-7B-2508 模型的两个版本现已在 Hugging Face 开源：

🤗 MiMo-VL-7B-RL-2508
- 推荐大多数用户体验和使用。
🤗 MiMo-VL-7B-SFT-2508
- 用户可基于此模型进行 SFT 和 RL 操作。与之前的 SFT 版本相比，该模型表现出更高的 RL 稳定性。

部署参数

temperature=0.3，topp=0.95
系统提示已在 chat_template.json 中设置，无需额外配置。

思维控制

用户可通过在查询后附加 /no_think 来控制思维模式：

思维模式查询（默认）：
"图片中的问题答案是什么？"
非思维模式查询：
"识别图片中的文字。/no_think"

❗️重要提示：/no_think 命令必须是用户消息的最后部分，这意味着 /no_think 之后不应有任何用户内容，如图像或视频。

视觉输入放置

对于包含单张图像或视频的提示，务必将视觉媒体放在文本之前。例如：

✅ 正确示例：

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text",  "text": "Describe the image. /no_think"},
        ],
    }
]

❌ 不当示例：

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text",  "text": "Describe the image. /no_think"},
            {"type": "image", "image": image_path},
        ],
    }
]

I. 引言

🛤️ 研究过程中的发现

在预训练阶段融入高质量、广覆盖的推理数据对提升模型性能至关重要
- 我们通过识别多样化查询、利用大型推理模型生成带长链思维（CoT）的响应，并采用拒绝采样确保质量，从而构建了高质量推理数据。
- 我们没有将此类数据视为补充性微调数据，而是将大量合成推理数据直接整合到预训练的后期阶段。实验表明，在此阶段进行扩展训练可使性能持续提升，而不会出现饱和。
混合在线强化学习（MORL）能进一步提升模型性能，但实现多维度指标的稳定同步改进仍具挑战
- 我们将 RL 应用于多种能力提升，包括推理、感知、定位以及人类偏好对齐，覆盖文本、图像和视频等多种模态。这种混合训练方法虽能进一步释放模型潜力，但数据域间的干扰仍是亟待解决的问题。

II. 模型详情

模型可在 Huggingface Collections: MiMo-VL 和 ModelScope Collections: MiMo-VL 获取

模型	描述	下载（HuggingFace）	下载（ModelScope）
MiMo-VL-7B-SFT	经过四阶段预训练后具备卓越推理潜力的视觉语言模型	🤗 XiaomiMiMo/MiMo-VL-7B-SFT	🤖️ XiaomiMiMo/MiMo-VL-7B-SFT
MiMo-VL-7B-RL	超越现有开源模型的强化学习模型	🤗 XiaomiMiMo/MiMo-VL-7B-RL	🤖️ XiaomiMiMo/MiMo-VL-7B-RL

III. 评估结果

通用能力

在通用视觉语言理解方面，MiMo-VL-7B 模型取得了开源领域的最先进成果。

推理任务

在多模态推理任务中，SFT 和 RL 模型在所有对比基准测试中均显著优于其他开源基线模型。

[!IMPORTANT] 标有 * 的结果通过我们的评估框架获得。标有 ${\dagger}$ 的任务由 GPT-4o 进行评估。

GUI 任务

MiMo-VL-7B-RL 具备卓越的 GUI 理解与定位能力。作为一款通用型视觉语言模型，MiMo-VL 的性能已达到甚至超越了 GUI 专用模型。

Elo 评分

IV. 部署

MiMo-VL-7B 系列在部署与推理方面完全兼容 Qwen2_5_VLForConditionalGeneration 架构。

V. 引用

@misc{coreteam2025mimovltechnicalreport,
      title={MiMo-VL Technical Report}, 
      author={LLM-Core-Team Xiaomi},
      year={2025},
      eprint={2506.03569},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.03569}, 
}

VI. 联系方式

如有任何问题，请通过 mimo@xiaomi.com 与我们联系，或提交 issue。

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ MiMo-VL 技术报告 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🔥🔥🔥MiMo-VL 2508 更新

📈 性能提升

🤔 思维控制功能

❤️ 用户体验优化

😉 案例研究：这款汽车有哪些吸引人的特点？

完整评估结果

模型推荐

部署参数

思维控制

视觉输入放置

I. 引言

🛤️ 研究过程中的发现

II. 模型详情

III. 评估结果

通用能力

推理任务

GUI 任务

Elo 评分

IV. 部署

V. 引用

VI. 联系方式

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ MiMo-VL 技术报告 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🔥🔥🔥MiMo-VL 2508 更新

📈 性能提升

🤔 思维控制功能

❤️ 用户体验优化

😉 案例研究：这款汽车有哪些吸引人的特点？

完整评估结果

模型推荐

部署参数

思维控制

视觉输入放置

I. 引言

🛤️ 研究过程中的发现

II. 模型详情

III. 评估结果

通用能力

推理任务

GUI 任务

Elo 评分

IV. 部署

V. 引用

VI. 联系方式

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MiMo-VL 技术报告
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MiMo-VL 技术报告
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━