[!Note] 这是 Kimi-VL-A3B-Thinking 的改进版本。建议使用此更新后的模型,而非先前版本。
[!Note] 请访问我们的技术博客以获取此模型的推荐推理方案:Kimi-VL-A3B-Thinking-2506:快速导航
这是 Kimi-VL-A3B-Thinking 的更新版本,具备以下增强能力:
与高效模型及前两代 Kimi-VL 的对比(GPT-4o 结果仅供参考,以斜体标注):
| 基准测试(指标) | GPT-4o | Qwen2.5-VL-7B | Gemma3-12B-IT | Kimi-VL-A3B-Instruct | Kimi-VL-A3B-Thinking | Kimi-VL-A3B-Thinking-2506 |
|---|---|---|---|---|---|---|
| 通用多模态能力 | ||||||
| MMBench-EN-v1.1 (准确率) | 83.1 | 83.2 | 74.6 | 82.9 | 76.0 | 84.4 |
| RealWorldQA (准确率) | 75.4 | 68.5 | 59.1 | 68.1 | 64.0 | 70.0 |
| OCRBench (准确率) | 815 | 864 | 702 | 864 | 864 | 869 |
| MMStar (准确率) | 64.7 | 63.0 | 56.1 | 61.7 | 64.2 | 70.4 |
| MMVet (准确率) | 69.1 | 67.1 | 64.9 | 66.7 | 69.5 | 78.1 |
| 推理能力 | ||||||
| MMMU (验证集, Pass@1) | 69.1 | 58.6 | 59.6 | 57.0 | 61.7 | 64.0 |
| MMMU-Pro (Pass@1) | 51.7 | 38.1 | 32.1 | 36.0 | 43.2 | 46.3 |
| 数学能力 | ||||||
| MATH-Vision (Pass@1) | 30.4 | 25.0 | 32.1 | 21.7 | 36.8 | 56.9 |
| MathVista_MINI (Pass@1) | 63.8 | 68.0 | 56.1 | 68.6 | 71.7 | 80.1 |
| 视频理解 | ||||||
| VideoMMMU (Pass@1) | 61.2 | 47.4 | 57.0 | 52.1 | 55.5 | 65.2 |
| MMVU (Pass@1) | 67.4 | 50.1 | 57.0 | 52.7 | 53.0 | 57.5 |
| Video-MME (含字幕) | 77.2 | 71.6 | 62.1 | 72.7 | 66.0 | 71.9 |
| 智能体基础能力 | ||||||
| ScreenSpot-Pro (准确率) | 0.8 | 29.0 | — | 35.4 | — | 52.8 |
| ScreenSpot-V2 (准确率) | 18.1 | 84.2 | — | 92.8 | — | 91.4 |
| OSWorld-G (准确率) | - | 31.5 | — | 41.6 | — | 52.5 |
| 长文档理解 | ||||||
| MMLongBench-DOC (准确率) | 42.8 | 29.6 | 21.3 | 35.1 | 32.5 | 42.1 |
与 30B-70B 开源模型的对比:
| 基准测试(指标) | Kimi-VL-A3B-Thinking-2506 | Qwen2.5-VL-32B | Qwen2.5-VL-72B | Gemma3-27B-IT |
|---|---|---|---|---|
| 通用多模态能力 | ||||
| MMBench-EN-v1.1 (准确率) | 84.4 | - | 88.3 | 78.9 |
| RealWorldQA (准确率) | 70.0 | - | 75.7 | 62.5 |
| OCRBench (准确率) | 869 | - | 885 | 753 |
| MMStar (准确率) | 70.4 | 69.5 | 70.8 | 63.1 |
| MMVet (准确率) | 78.1 | - | 74.0 | 71.0 |
| 推理能力 | ||||
| MMMU (验证集, Pass@1) | 64.0 | 70.0 | 70.2 | 64.9 |
| MMMU-Pro (Pass@1) | 46.3 | 49.5 | 51.1 | - |
| MATH-Vision (Pass@1) | 56.9 | 38.4 | 38.1 | 35.4 |
| MathVista_MINI (Pass@1) | 80.1 | 74.7 | 74.8 | 59.8 |
| 视频理解 | ||||
| VideoMMMU (Pass@1) | 65.2 | - | 60.2 | 61.8 |
| MMVU (Pass@1) | 57.5 | - | 62.9 | 61.3 |
| Video-MME (含字幕) | 71.9 | 70.5/77.9 | 73.3/79.1 | - |
| 智能体基础能力 | ||||
| ScreenSpot-Pro (准确率) | 52.8 | 39.4 | 43.6 | - |
| ScreenSpot-V2 (准确率) | 91.4 | - | - | - |
| OSWorld-G (准确率) | 52.5 | 46.5 | - | - |
| 长文档理解 | ||||
| MMLongBench-DOC (准确率) | 42.1 | - | 38.8 | - |
纯文本结果,与 30B 级非思维增强视觉语言模型的对比:
| 基准测试(指标) | Kimi-VL-A3B-Thinking-2506 | Qwen2.5-VL-32B | Gemma3-27B-IT |
|---|---|---|---|
| MMLU | 82.0 | 78.4 | 76.9 |
| MMLU-Pro | 68.5 | 68.8 | 67.5 |
| MATH | 91.8 | 82.2 | 89.0 |
| GPQA-Diamond | 42.3 | 46.0 | 46.0 |
作为可生成长达 32K tokens 的长序列解码模型,我们推荐使用 VLLM 进行推理,该框架已全面支持 Kimi-VL 系列模型。
MAX_JOBS=4 pip install vllm==0.9.1 blobfile flash-attn --no-build-isolation[!注意] 为避免出现 CUDA 内存不足的情况,必须显式安装 flash-attn。
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
llm = LLM(
model_path,
trust_remote_code=True,
max_num_seqs=8,
max_model_len=131072,
limit_mm_per_prompt={"image": 256}
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
sampling_params = SamplingParams(max_tokens=32768, temperature=0.8)
import requests
from PIL import Image
def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
if bot in text and eot not in text:
return ""
if eot in text:
return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
return "", text
OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
image = Image.open(requests.get(url,stream=True).raw)
messages = [
{"role": "user", "content": [{"type": "image", "image": ""}, {"type": "text", "text": "What kind of cat is this? Answer with one word."}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
thinking, summary = extract_thinking_and_summary(generated_text)
print(OUTPUT_FORMAT.format(thinking=thinking, summary=summary))本节介绍如何在推理阶段使用 transformers 库调用我们的模型。推荐使用 python=3.10、torch>=2.1.0 和 transformers=4.48.2 作为开发环境。
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
if bot in text and eot not in text:
return ""
if eot in text:
return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
return "", text
OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
image_paths = [url]
images = [Image.open(path) for path in image_paths]
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path} for image_path in image_paths
] + [{"type": "text", "text": "What kind of cat is this? Answer with one word."}],
},
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=32768, temperature=0.8)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)@misc{kimiteam2025kimivltechnicalreport,
title={{Kimi-VL} Technical Report},
author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
year={2025},
eprint={2504.07491},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.07491},
}