MoonshotAI/Kimi-VL-A3B-Thinking-2506
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

[!Note] 这是 Kimi-VL-A3B-Thinking 的改进版本。建议使用此更新后的模型,而非先前版本。

[!Note] 请访问我们的技术博客以获取此模型的推荐推理方案:Kimi-VL-A3B-Thinking-2506:快速导航

📄 技术报告  |  📄 Github  |  💬 聊天网页

1. 简介

这是 Kimi-VL-A3B-Thinking 的更新版本,具备以下增强能力:

  • 思考更智能,消耗更少 Token:2506 版本在多模态推理基准测试中达到更高准确率:MathVision 56.9(+20.1)、MathVista 80.1(+8.4)、MMMU-Pro 46.3(+3.3)、MMMU 64.0(+2.1),同时平均所需思考长度减少 20%。
  • 借助思考看得更清晰:与先前专注于思考任务的版本不同,2506 版本在通用视觉感知与理解任务上也达到同等甚至更优能力,例如 MMBench-EN-v1.1(84.4)、MMStar(70.4)、RealWorldQA(70.0)、MMVet(78.4),超越或匹配了我们非思考模型(Kimi-VL-A3B-Instruct)的能力。
  • 扩展至视频场景:新版 2506 版本在视频推理与理解基准测试上亦有提升。它在 VideoMMMU(65.2)上为开源模型设立了新的 state-of-the-art,同时在通用视频理解任务上保持良好能力(Video-MME 71.9,匹配 Kimi-VL-A3B-Instruct)。
  • 扩展至更高分辨率:新版 2506 版本支持单张图像总计 320 万像素,是先前版本的 4 倍。这带来了在高分辨率感知和 OS-agent grounding 基准测试上的显著提升:V* Benchmark 83.2(无需额外工具)、ScreenSpot-Pro 52.8、OSWorld-G 52.5(完整集含拒绝判断)。

2. 性能表现

与高效模型及前两代 Kimi-VL 的对比(GPT-4o 结果仅供参考,以斜体标注):

基准测试(指标)GPT-4oQwen2.5-VL-7BGemma3-12B-ITKimi-VL-A3B-InstructKimi-VL-A3B-ThinkingKimi-VL-A3B-Thinking-2506
通用多模态能力
MMBench-EN-v1.1 (准确率)83.183.274.682.976.084.4
RealWorldQA (准确率)75.468.559.168.164.070.0
OCRBench (准确率)815864702864864869
MMStar (准确率)64.763.056.161.764.270.4
MMVet (准确率)69.167.164.966.769.578.1
推理能力
MMMU (验证集, Pass@1)69.158.659.657.061.764.0
MMMU-Pro (Pass@1)51.738.132.136.043.246.3
数学能力
MATH-Vision (Pass@1)30.425.032.121.736.856.9
MathVista_MINI (Pass@1)63.868.056.168.671.780.1
视频理解
VideoMMMU (Pass@1)61.247.457.052.155.565.2
MMVU (Pass@1)67.450.157.052.753.057.5
Video-MME (含字幕)77.271.662.172.766.071.9
智能体基础能力
ScreenSpot-Pro (准确率)0.829.0—35.4—52.8
ScreenSpot-V2 (准确率)18.184.2—92.8—91.4
OSWorld-G (准确率)-31.5—41.6—52.5
长文档理解
MMLongBench-DOC (准确率)42.829.621.335.132.542.1

与 30B-70B 开源模型的对比:

基准测试(指标)Kimi-VL-A3B-Thinking-2506Qwen2.5-VL-32BQwen2.5-VL-72BGemma3-27B-IT
通用多模态能力
MMBench-EN-v1.1 (准确率)84.4-88.378.9
RealWorldQA (准确率)70.0-75.762.5
OCRBench (准确率)869-885753
MMStar (准确率)70.469.570.863.1
MMVet (准确率)78.1-74.071.0
推理能力
MMMU (验证集, Pass@1)64.070.070.264.9
MMMU-Pro (Pass@1)46.349.551.1-
MATH-Vision (Pass@1)56.938.438.135.4
MathVista_MINI (Pass@1)80.174.774.859.8
视频理解
VideoMMMU (Pass@1)65.2-60.261.8
MMVU (Pass@1)57.5-62.961.3
Video-MME (含字幕)71.970.5/77.973.3/79.1-
智能体基础能力
ScreenSpot-Pro (准确率)52.839.443.6-
ScreenSpot-V2 (准确率)91.4---
OSWorld-G (准确率)52.546.5--
长文档理解
MMLongBench-DOC (准确率)42.1-38.8-

纯文本结果,与 30B 级非思维增强视觉语言模型的对比:

基准测试(指标)Kimi-VL-A3B-Thinking-2506Qwen2.5-VL-32BGemma3-27B-IT
MMLU82.078.476.9
MMLU-Pro68.568.867.5
MATH91.882.289.0
GPQA-Diamond42.346.046.0

3. 使用说明

3.1. 使用 VLLM 进行推理(推荐)

作为可生成长达 32K tokens 的长序列解码模型,我们推荐使用 VLLM 进行推理,该框架已全面支持 Kimi-VL 系列模型。

MAX_JOBS=4 pip install vllm==0.9.1 blobfile flash-attn --no-build-isolation

[!注意] 为避免出现 CUDA 内存不足的情况,必须显式安装 flash-attn。

from transformers import AutoProcessor
from vllm import LLM, SamplingParams

model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
llm = LLM(
    model_path,
    trust_remote_code=True,
    max_num_seqs=8,
    max_model_len=131072,
    limit_mm_per_prompt={"image": 256}
)

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

sampling_params = SamplingParams(max_tokens=32768, temperature=0.8)


import requests
from PIL import Image

def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
    if bot in text and eot not in text:
        return ""
    if eot in text:
        return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
    return "", text

OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"

url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
image = Image.open(requests.get(url,stream=True).raw)

messages = [
    {"role": "user", "content": [{"type": "image", "image": ""}, {"type": "text", "text": "What kind of cat is this? Answer with one word."}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

thinking, summary = extract_thinking_and_summary(generated_text)
print(OUTPUT_FORMAT.format(thinking=thinking, summary=summary))

3.2. 使用 🤗 Hugging Face Transformers 进行推理

本节介绍如何在推理阶段使用 transformers 库调用我们的模型。推荐使用 python=3.10、torch>=2.1.0 和 transformers=4.48.2 作为开发环境。

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
    if bot in text and eot not in text:
        return ""
    if eot in text:
        return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
    return "", text

OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"

url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"

model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

image_paths = [url]
images = [Image.open(path) for path in image_paths]
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path} for image_path in image_paths
        ] + [{"type": "text", "text": "What kind of cat is this? Answer with one word."}],
    },
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=32768, temperature=0.8)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)

4. 引用文献

@misc{kimiteam2025kimivltechnicalreport,
      title={{Kimi-VL} Technical Report}, 
      author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
      year={2025},
      eprint={2504.07491},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.07491}, 
}