GitHub | CookBook | Technical Report | Demo
MiniCPM-V 4.5 是 MiniCPM-V 系列中最新且功能最强的模型。该模型基于 Qwen3-8B 和 SigLIP2-400M 构建,总参数量为 80 亿。与之前的 MiniCPM-V 和 MiniCPM-o 模型相比,它在性能上有显著提升,并引入了新的实用功能。MiniCPM-V 4.5 的主要特性包括:
🔥 顶尖的视觉语言能力 MiniCPM-V 4.5 在 OpenCompass 平台的 8 项主流基准测试中平均得分达 77.0。凭借仅 80 亿参数量,其视觉语言能力已超越 GPT-4o-latest、Gemini-2.0 Pro 等广泛使用的闭源模型,以及 Qwen2.5-VL 72B 等强大的开源模型,成为 300 亿参数以下性能最优的多模态大模型。
🎬 高效高帧率与长视频理解 通过全新的图像视频统一 3D-Resampler 技术,MiniCPM-V 4.5 实现了 96 倍的视频 token 压缩率——6 帧 448x448 视频帧可联合压缩为 64 个视频 token(多数多模态大模型通常需要 1,536 个 token)。这意味着模型能在不增加大语言模型推理成本的前提下,感知更多视频帧。该技术使其在 Video-MME、LVBench、MLVU、MotionBench、FavorBench 等评测集上,高效实现了顶尖的高帧率(最高 10FPS)视频理解和长视频理解能力。
⚙️ 可控的快慢思维切换 MiniCPM-V 4.5 支持两种工作模式:快速思维模式适用于高频次高效使用场景,同时保持竞争力;深度思维模式则针对更复杂的问题求解。为平衡不同用户场景下的效率与性能需求,快慢思维模式可通过高度可控的方式进行切换。
💪 强大的 OCR、文档解析及其他能力 基于 LLaVA-UHD 架构,MiniCPM-V 4.5 可处理任意宽高比、像素量高达 180 万(如 1344x1344)的高分辨率图像,视觉 token 用量仅为多数多模态大模型的 1/4。该模型在 OCRBench 上超越 GPT-4o-latest、Gemini 2.5 等闭源模型,取得领先性能;在通用多模态大模型中,其 PDF 文档解析能力在 OmniDocBench 上达到当前最佳水平。依托最新的 RLAIF-V 和 VisCPM 技术,模型具备可信行为,在 MMHal-Bench 上表现优于 GPT-4o-latest,并支持超过 30 种语言的多语种能力。
💫 便捷易用 MiniCPM-V 4.5 提供多种便捷使用方式:(1) 支持 llama.cpp 和 ollama,可在本地设备高效进行 CPU 推理;(2) 提供 int4、GGUF 和 AWQ 格式的量化模型,包含 16 种不同规格;(3) 支持 SGLang 和 vLLM,实现高吞吐量和内存高效推理;(4) 可通过 Transformers 和 LLaMA-Factory 对新领域和任务进行微调;(5) 提供快速的 本地 WebUI 演示;(6) 针对 iPhone 和 iPad 优化的 本地 iOS 应用;(7) 服务器端 在线网页演示。完整使用指南详见 Cookbook!
架构:用于高密度视频压缩的统一3D-Resampler。MiniCPM-V 4.5 引入了3D-Resampler,克服了视频理解中性能与效率的权衡问题。通过将多达6个连续视频帧分组并联合压缩为仅64个token(与MiniCPM-V系列中处理单张图像所用的token数量相同),MiniCPM-V 4.5实现了96倍的视频token压缩率。这使得模型能够处理更多视频帧,而无需额外的LLM计算成本,从而支持高帧率视频和长视频理解。该架构支持对图像、多图像输入和视频进行统一编码,确保能力和知识的无缝迁移。
预训练:文档OCR与知识的统一学习。现有的MLLM通过孤立的训练方式分别学习OCR能力和文档知识。我们观察到,这两种训练方式的本质区别在于图像中文本的可见性。通过动态地用不同噪声水平损坏文档中的文本区域,并要求模型重建文本,模型学会了自适应且适当地在精确文本识别(当文本可见时)和基于多模态上下文的知识推理(当文本被严重遮挡时)之间进行切换。这消除了在从文档学习知识时对易出错的文档解析器的依赖,并防止了过度增强的OCR数据导致的幻觉问题,从而以最小的工程开销实现了顶级的OCR和多模态知识性能。
后训练:基于多模态强化学习的混合快速/深度思维。MiniCPM-V 4.5通过两种可切换模式提供平衡的推理体验:用于高效日常使用的快速思维模式和用于复杂任务的深度思维模式。采用新的混合强化学习方法,模型对两种模式进行联合优化,在不损害深度模式能力的前提下,显著提升了快速模式的性能。结合RLPR和RLAIF-V,它从广泛的多模态数据中泛化出稳健的推理技能,同时有效减少幻觉。
OpenCompass
| 模型 | 规模 | 平均得分 ↑ | 总推理时间 ↓ |
|---|---|---|---|
| GLM-4.1V-9B-Thinking | 10.3B | 76.6 | 17.5小时 |
| MiMo-VL-7B-RL | 8.3B | 76.4 | 11小时 |
| MiniCPM-V 4.5 | 8.7B | 77.0 | 7.5小时 |
Video-MME
| 模型 | 规模 | 平均得分 ↑ | 总推理时间 ↓ | GPU内存 ↓ |
|---|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | 8.3B | 71.6 | 3小时 | 60G |
| GLM-4.1V-9B-Thinking | 10.3B | 73.6 | 2.63小时 | 32G |
| MiniCPM-V 4.5 | 8.7B | 73.5 | 0.26小时 | 28G |
Video-MME和OpenCompass的评估均使用8×A100 GPU进行推理。Video-MME报告的推理时间包含完整的模型端计算,为保证公平对比,不包含视频帧提取的外部耗时(该耗时取决于具体的帧提取工具)。
我们已通过iOS 演示应用在 iPad M4 上部署了 MiniCPM-V 4.5。演示视频为未经剪辑的原始屏幕录制内容。
| 类别 | 框架 | 使用指南链接 | 上游 PR | 支持起始版本(分支) | 支持起始版本(发布版) |
|---|---|---|---|---|---|
| 边缘端(设备端) | Llama.cpp | Llama.cpp 文档 | #15575(2025-08-26) | master(2025-08-26) | b6282 |
| Ollama | Ollama 文档 | #12078(2025-08-26) | 合并中 | 等待官方发布 | |
| 服务端(云端) | vLLM | vLLM 文档 | #23586(2025-08-26) | main(2025-08-27) | v0.10.2 |
| SGLang | SGLang 文档 | #9610(2025-08-26) | 合并中 | 等待官方发布 | |
| 微调 | LLaMA-Factory | LLaMA-Factory 文档 | #9022(2025-08-26) | main(2025-08-26) | 等待官方发布 |
| 量化 | GGUF | GGUF 文档 | — | — | — |
| BNB | BNB 文档 | — | — | — | |
| AWQ | AWQ 文档 | — | — | — | |
| 演示应用 | Gradio 演示 | Gradio 演示文档 | — | — | — |
注意:如果您希望我们优先支持其他开源框架,请通过此简短表单告知我们。
若要启用思考模式,请在聊天函数中提供参数 enable_thinking=True。
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
torch.manual_seed(100)
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
stream=True # If `stream=True`, the answer is string
# First round chat
question = "What is the landform in the picture?"
msgs = [{'role': 'user', 'content': [image, question]}]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer,
enable_thinking=enable_thinking,
stream=True
)
generated_text = ""
for new_text in answer:
generated_text += new_text
print(new_text, flush=True, end='')
# Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [generated_text]})
msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer,
stream=True
)
generated_text = ""
for new_text in answer:
generated_text += new_text
print(new_text, flush=True, end='')您将获得以下输出:
# round1
The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion.
This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views.
# round2
When traveling to a karst landscape like this, here are some important tips:
1. Wear comfortable shoes: The terrain can be uneven and hilly.
2. Bring water and snacks for energy during hikes or boat rides.
3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots.
4. Respect local customs and nature regulations by not littering or disturbing wildlife.
By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
# To achieve this, you need to organize your video data into two corresponding sequences:
# frames: List[Image]
# temporal_ids: List[List[Int]].
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
from scipy.spatial import cKDTree
import numpy as np
import math
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING.
MAX_NUM_PACKING=3 # indicates the maximum packing number of video frames. valid range: 1-6
TIME_SCALE = 0.1
def map_to_nearest_scale(values, scale):
tree = cKDTree(np.asarray(scale)[:, None])
_, indices = tree.query(np.asarray(values)[:, None])
return np.asarray(scale)[indices]
def group_array(arr, size):
return [arr[i:i+size] for i in range(0, len(arr), size)]
def encode_video(video_path, choose_fps=3, force_packing=None):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
fps = vr.get_avg_fps()
video_duration = len(vr) / fps
if choose_fps * int(video_duration) <= MAX_NUM_FRAMES:
packing_nums = 1
choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration))
else:
packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES)
if packing_nums <= MAX_NUM_PACKING:
choose_frames = round(video_duration * choose_fps)
else:
choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING)
packing_nums = MAX_NUM_PACKING
frame_idx = [i for i in range(0, len(vr))]
frame_idx = np.array(uniform_sample(frame_idx, choose_frames))
if force_packing:
packing_nums = min(force_packing, MAX_NUM_PACKING)
print(video_path, ' duration:', video_duration)
print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}')
frames = vr.get_batch(frame_idx).asnumpy()
frame_idx_ts = frame_idx / fps
scale = np.arange(0, video_duration, TIME_SCALE)
frame_ts_id = map_to_nearest_scale(frame_idx_ts, scale) / TIME_SCALE
frame_ts_id = frame_ts_id.astype(np.int32)
assert len(frames) == len(frame_ts_id)
frames = [Image.fromarray(v.astype('uint8')).convert('RGB') for v in frames]
frame_ts_id_group = group_array(frame_ts_id, packing_nums)
return frames, frame_ts_id_group
video_path="video_test.mp4"
fps = 5 # fps for video
force_packing = None # You can set force_packing to ensure that 3D packing is forcibly enabled; otherwise, encode_video will dynamically set the packing quantity based on the duration.
frames, frame_ts_id_group = encode_video(video_path, fps, force_packing=force_packing)
question = "Describe the video"
msgs = [
{'role': 'user', 'content': frames + [question]},
]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer,
use_image_id=False,
max_slice_nums=1,
temporal_ids=frame_ts_id_group
)
print(answer)import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer
)
print(answer)import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')
msgs = [
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
{'role': 'user', 'content': [image_test, question]}
]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer
)
print(answer)👏 欢迎探索 MiniCPM-V 4.5 的核心技术以及我们团队的其他多模态项目:
VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V
如果您觉得我们的工作对您有所帮助,请考虑引用我们的论文 📝 并为该项目点赞 ❤️!
@misc{yu2025minicpmv45cookingefficient,
title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe},
author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and Bokai Xu and Junbo Cui and Yingjing Xu and Liqing Ruan and Luoyuan Zhang and Hanyu Liu and Jingkun Tang and Hongyuan Liu and Qining Guo and Wenhao Hu and Bingxiang He and Jie Zhou and Jie Cai and Ji Qi and Zonghao Guo and Chi Chen and Guoyang Zeng and Yuxuan Li and Ganqu Cui and Ning Ding and Xu Han and Yuan Yao and Zhiyuan Liu and Maosong Sun},
year={2025},
eprint={2509.18154},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.18154},
}
@article{yao2024minicpm,
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
journal={Nat Commun 16, 5509 (2025)},
year={2025}
}