我们推出Kimi-VL——一个高效的开源混合专家(MoE)视觉语言模型(VLM),具备先进的多模态推理能力、长上下文理解能力和强大的智能体功能,而其语言解码器仅激活28亿参数(Kimi-VL-A3B)。
Kimi-VL在多个挑战性领域展现出卓越性能: 作为通用VLM,它在多轮智能体交互任务(如OSWorld)中表现突出,达到与旗舰模型相当的最先进水平。 此外,该模型在各类高难度视觉语言任务中展现出非凡能力,包括大学级图像视频理解、光学字符识别(OCR)、数学推理、多图像理解等。
在对比评估中,它与GPT-4o-mini、Qwen2.5-VL-7B、Gemma-3-12B-IT等前沿高效VLM有效竞争,并在多个专业领域超越GPT-4o。
Kimi-VL还推动了多模态模型在长上下文处理与清晰感知方面的帕累托边界:搭载128K扩展上下文窗口,可处理长而多样的输入,在LongVideoBench获得64.5分,在MMLongBench-Doc获得35.1分;其原生分辨率视觉编码器MoonViT进一步支持超高清视觉输入的理解,在InfoVQA获得83.2分,在ScreenSpot-Pro获得34.5分,同时在常见视觉输入和通用任务中保持较低计算成本。
基于此,我们推出进阶长思维变体:Kimi-VL-Thinking。通过长思维链(CoT)监督微调(SFT)和强化学习(RL)开发,该模型展现出强大的长程推理能力,在MMMU获得61.7分,MathVision获得36.8分,MathVista获得71.3分,同时保持紧凑的28亿激活LLM参数规模,为高效而强大的多模态思维模型树立了新标杆。
该模型采用混合专家(MoE)语言模型架构,配备原生分辨率视觉编码器(MoonViT)和多层感知机投影器(MLP projector),具体结构如下图所示。
🤗 针对通用多模态感知与理解、OCR识别、长视频/长文档处理、视频感知及智能体应用场景,推荐使用 Kimi-VL-A3B-Instruct 以实现高效推理;若需进行高级文本与多模态推理(如数学运算),请考虑选用 Kimi-VL-A3B-Thinking。
| 模型名称 | 总参数量 | 激活参数量 | 上下文长度 | 下载链接 |
|---|---|---|---|---|
| Kimi-VL-A3B-Instruct | 16B | 3B | 128K | 🤗 Hugging Face |
| Kimi-VL-A3B-Thinking | 16B | 3B | 128K | 🤗 Hugging Face |
[!Note] 推荐参数设置:
- Thinking系列模型 建议使用
Temperature = 0.8- Instruct系列模型 建议使用
Temperature = 0.2- 非思考型(instruct)模型可采用贪婪采样(
Temperature = 0.0)(与我们的评估设置保持一致)
作为高效模型,Kimi-VL 能稳健处理多种任务(细粒度感知、数学运算、大学级难题、OCR识别、智能体等),并广泛支持多种输入形式(单图像、多图像、视频、长文档等)。
与现有10B级稠密视觉语言模型及DeepSeek-VL2(A4.5B)的简要对比:
完整性能对比(含GPT-4o参考数据):
| 基准测试(指标) | GPT-4o | GPT-4o-Mini | Qwen2.5-VL-7B | Llama3.2-11B-Inst. | Gemma3-12B-IT | DeepSeek-VL2 | Kimi-VL-A3B-Instruct |
|---|---|---|---|---|---|---|---|
| 架构类型 | - | - | 稠密模型 | 稠密模型 | 稠密模型 | 混合专家 | 混合专家 |
| 激活参数量(LLM+VT) | - | - | 7.6B+0.7B | 8B+2.6B | 12B+0.4B | 4.1B+0.4B | 2.8B+0.4B |
| 总参数量 | - | - | 8B | 11B | 12B | 28B | 16B |
| 大学级测试 | |||||||
| MMMU-Val (通过率@1) | 69.1 | 60.0 | 58.6 | 48 | 59.6 | 51.1 | 57.0 |
| VideoMMMU (通过率@1) | 61.2 | - | 47.4 | 41.8 | 57.2 | 44.4 | 52.6 |
| MMVU-Val (通过率@1) | 67.4 | 61.6 | 50.1 | 44.4 | 57.0 | 52.1 | 52.2 |
| 通用能力 | |||||||
| MMBench-EN-v1.1 (准确率) | 83.1 | 77.1 | 82.6 | 65.8 | 74.6 | 79.6 | 83.1 |
| MMStar (准确率) | 64.7 | 54.8 | 63.9 | 49.8 | 56.1 | 55.5 | 61.3 |
| MMVet (通过率@1) | 69.1 | 66.9 | 67.1 | 57.6 | 64.9 | 60.0 | 66.7 |
| RealWorldQA (准确率) | 75.4 | 67.1 | 68.5 | 63.3 | 59.1 | 68.4 | 68.1 |
| AI2D (准确率) | 84.6 | 77.8 | 83.9 | 77.3 | 78.1 | 81.4 | 84.9 |
| 多图像理解 | |||||||
| BLINK (准确率) | 68.0 | 53.6 | 56.4 | 39.8 | 50.3 | - | 57.3 |
| 数学能力 | |||||||
| MathVista (通过率@1) | 63.8 | 52.5 | 68.2 | 47.7 | 56.1 | 62.8 | 68.7 |
| MathVision (通过率@1) | 30.4 | - | 25.1 | 13.6 | 32.1 | 17.3 | 21.4 |
| OCR能力 | |||||||
| InfoVQA (准确率) | 80.7 | 57.9 | 82.6 | 34.6 | 43.8 | 78.1 | 83.2 |
| OCRBench (准确率) | 815 | 785 | 864 | 753 | 702 | 811 | 867 |
| 操作系统智能体 | |||||||
| ScreenSpot-V2 (准确率) | 18.1 | 6.9 | 84.2 | - | - | - | 92.8 |
| ScreenSpot-Pro (准确率) | 0.8 | - | 29.0 | - | - | - | 34.5 |
| OSWorld (通过率@1) | 5.03 | - | 2.5 | - | - | - | 8.22 |
| WindowsAgentArena (通过率@1) | 9.4 | 2.7 | 3.4 | - | - | - | 10.4 |
| 长文档处理 | |||||||
| MMLongBench-Doc (准确率) | 42.8 | 29.0 | 29.6 | 13.8 | 21.3 | - | 35.1 |
| 长视频理解 | |||||||
| Video-MME (无字幕) | 71.9 | 64.8 | 65.1 | 46.0 | 58.2 | - | 67.8 |
| Video-MME (含字幕) | 77.2 | 68.9 | 71.6 | 49.5 | 62.1 | - | 72.6 |
| MLVU-MCQ (准确率) | 64.6 | 48.1 | 70.2 | 44.4 | 52.3 | - | 74.2 |
| LongVideoBench (验证集) | 66.7 | 58.2 | 56.0 | 45.5 | 51.5 | - | 64.5 |
| 视频感知 | |||||||
| EgoSchema (完整版) | 72.2 | - | 65.0 | 54.3 | 56.9 | 38.5 | 78.5 |
| VSI-Bench | 34.0 | - | 34.2 | 20.6 | 32.4 | 21.7 | 37.4 |
| TOMATO | 37.7 | 28.8 | 27.6 | 21.5 | 28.6 | 27.2 | 31.7 |
[!注意] 适用于 OS 代理任务的推荐提示词(预期输出为坐标点):
请观察屏幕截图,请通过操作和坐标点定位以下元素。<指令> [您的指令]
我们将介绍如何使用 transformers 库在推理阶段调用我们的模型。建议使用 python=3.10、torch>=2.1.0 和 transformers=4.48.2 作为开发环境。
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
model_path = "moonshotai/Kimi-VL-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
image_path = "./figures/demo.png"
image = Image.open(image_path)
messages = [
{"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)我们已向 vLLM 提交了合并请求 #16387。在 MR 合并之前,欢迎使用与 vLLM MR 对应的分支来部署 Kimi-VL。
@misc{kimiteam2025kimivltechnicalreport,
title={{Kimi-VL} Technical Report},
author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
year={2025},
eprint={2504.07491},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.07491},
}