一款达到 GPT-4o 水平的多模态大语言模型，可在手机上实现单图像、多图像及高帧率视频理解

GitHub | CookBook | Technical Report | Demo

MiniCPM-V 4.5

MiniCPM-V 4.5 是 MiniCPM-V 系列中最新且功能最强大的模型。该模型基于 Qwen3-8B 和 SigLIP2-400M 构建，总参数量为 80 亿。与之前的 MiniCPM-V 和 MiniCPM-o 模型相比，它的性能有显著提升，并引入了新的实用功能。MiniCPM-V 4.5 的显著特点包括：

🔥 最先进的视觉-语言能力 MiniCPM-V 4.5 在 OpenCompass（包含 8 个主流基准的综合评估）上取得了 77.0 的平均分。仅需 80 亿参数，它就在视觉-语言能力方面超越了 GPT-4o-latest、Gemini-2.0 Pro 等广泛使用的专有模型以及 Qwen2.5-VL 72B 等强大的开源模型，成为 300 亿参数以下性能最佳的多模态大语言模型。
🎬 高效的高帧率和长视频理解 借助全新的图像与视频统一 3D-Resampler，MiniCPM-V 4.5 实现了 96 倍的视频 token 压缩率，6 帧 448x448 的视频帧可联合压缩为 64 个视频 token（大多数多模态大语言模型通常需要 1,536 个 token）。这意味着该模型能够在不增加大语言模型推理成本的情况下感知更多视频帧。这使其在 Video-MME、LVBench、MLVU、MotionBench、FavorBench 等数据集上高效地实现了最先进的高帧率（最高 10FPS）视频理解和长视频理解能力。
⚙️ 可控的混合快速/深度思维 MiniCPM-V 4.5 既支持快速思维模式，可实现高效的频繁使用并保持有竞争力的性能；也支持深度思维模式，用于解决更复杂的问题。为了适应不同用户场景下效率与性能的权衡，这种快速/深度思维模式可以高度可控地进行切换。
💪 强大的 OCR、文档解析及其他能力 基于 LLaVA-UHD 架构，MiniCPM-V 4.5 能够处理任意宽高比、高达 180 万像素（例如 1344x1344）的高分辨率图像，所使用的视觉 token 数量仅为大多数多模态大语言模型的四分之一。该模型在 OCRBench 上取得了领先性能，超越了 GPT-4o-latest 和 Gemini 2.5 等专有模型。在 OmniDocBench 上，其 PDF 文档解析能力在通用多模态大语言模型中也达到了最先进水平。基于最新的 RLAIF-V 和 VisCPM 技术，它具备可信赖的行为，在 MMHal-Bench 上的表现优于 GPT-4o-latest，并支持超过 30 种语言的多语言能力。
💫 易于使用 MiniCPM-V 4.5 可通过多种方式轻松使用：(1) llama.cpp 和 ollama 支持在本地设备上进行高效的 CPU 推理，(2) 提供 16 种尺寸的 int4、GGUF 和 AWQ 格式量化模型，(3) SGLang 和 vLLM 支持高吞吐量和内存高效的推理，(4) 可使用 Transformers 和 LLaMA-Factory 对新领域和任务进行微调，(5) 快速搭建本地 WebUI 演示，(6) 针对 iPhone 和 iPad 优化的本地 iOS 应用，以及 (7) 服务器端的在线网页演示。完整用法详见我们的 Cookbook！

关键技术

架构：用于高密度视频压缩的统一 3D-Resampler。 MiniCPM-V 4.5 引入了 3D-Resampler，克服了视频理解中性能与效率的权衡问题。通过将多达 6 个连续视频帧进行分组并联合压缩为仅 64 个 tokens（与 MiniCPM-V 系列中处理单张图像所用的 token 数量相同），MiniCPM-V 4.5 实现了 96 倍的视频 token 压缩率。这使得模型能够处理更多的视频帧，而不会增加 LLM 的计算成本，从而支持高帧率视频和长视频理解。该架构支持对图像、多图像输入和视频进行统一编码，确保能力和知识的无缝迁移。
预训练：文档 OCR 与知识的统一学习。 现有的 MLLM 通常采用孤立的训练方式来学习 OCR 能力和文档知识。我们观察到，这两种训练方式的本质区别在于图像中文本的可见性。通过动态地用不同噪声水平破坏文档中的文本区域，并要求模型重建文本，模型学会了自适应且恰当地在精确文本识别（当文本可见时）和基于多模态上下文的知识推理（当文本被严重遮挡时）之间进行切换。这消除了从文档学习知识时对易出错的文档解析器的依赖，并防止了过度增强的 OCR 数据导致的幻觉问题，从而以最小的工程开销实现了顶级的 OCR 和多模态知识性能。
后训练：基于多模态强化学习的混合快速/深度思维。 MiniCPM-V 4.5 通过两种可切换模式提供平衡的推理体验：用于高效日常使用的快速思维模式和用于复杂任务的深度思维模式。采用新的混合强化学习方法，模型对两种模式进行联合优化，在不损害深度模式能力的前提下，显著提升了快速模式的性能。结合 RLPR 和 RLAIF-V，该模型从广泛的多模态数据中泛化出稳健的推理技能，同时有效减少幻觉。

评估

推理效率

OpenCompass

模型	规模	平均得分 ↑	总推理时间 ↓
GLM-4.1V-9B-Thinking	10.3B	76.6	17.5小时
MiMo-VL-7B-RL	8.3B	76.4	11小时
MiniCPM-V 4.5	8.7B	77.0	7.5小时

Video-MME

模型	规模	平均得分 ↑	总推理时间 ↓	GPU内存 ↓
Qwen2.5-VL-7B-Instruct	8.3B	71.6	3小时	60G
GLM-4.1V-9B-Thinking	10.3B	73.6	2.63小时	32G
MiniCPM-V 4.5	8.7B	73.5	0.26小时	28G

Video-MME和OpenCompass的评估均使用8×A100 GPU进行推理。Video-MME报告的推理时间包含完整的模型端计算，为保证公平比较，不包含视频帧提取的外部成本（取决于具体的帧提取工具）。

示例

我们已通过iOS demo在iPad M4上部署了MiniCPM-V 4.5。演示视频为未经编辑的原始屏幕录制内容。

框架支持矩阵

类别	框架	使用指南链接	上游PR	支持起始版本(分支)	支持起始版本(发布版)
边缘端(设备本地)	Llama.cpp	Llama.cpp 文档	#15575(2025-08-26)	master(2025-08-26)	b6282
边缘端(设备本地)	Ollama	Ollama 文档	#12078(2025-08-26)	合并中	等待官方发布
服务端(云端)	vLLM	vLLM 文档	#23586(2025-08-26)	main(2025-08-27)	v0.10.2
服务端(云端)	SGLang	SGLang 文档	#9610(2025-08-26)	合并中	等待官方发布
微调	LLaMA-Factory	LLaMA-Factory 文档	#9022(2025-08-26)	main(2025-08-26)	等待官方发布
量化	GGUF	GGUF 文档	—	—	—
	BNB	BNB 文档	—	—	—
	AWQ	AWQ 文档	—	—	—
演示	Gradio Demo	Gradio 演示文档	—	—	—

注意：如果您希望我们优先支持其他开源框架，请通过此简短表单告知我们。

使用方法

如果您希望启用思考模式，请在聊天函数中提供参数 enable_thinking=True。

图文聊天

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

torch.manual_seed(100)

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6

image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')

enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
stream=True # If `stream=True`, the answer is string

# First round chat 
question = "What is the landform in the picture?"
msgs = [{'role': 'user', 'content': [image, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    enable_thinking=enable_thinking,
    stream=True
)

generated_text = ""
for new_text in answer:
    generated_text += new_text
    print(new_text, flush=True, end='')

# Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [generated_text]})
msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    stream=True
)

generated_text = ""
for new_text in answer:
    generated_text += new_text
    print(new_text, flush=True, end='')

您将获得以下输出：

# round1
The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion.

This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views.

# round2
When traveling to a karst landscape like this, here are some important tips:

1. Wear comfortable shoes: The terrain can be uneven and hilly.
2. Bring water and snacks for energy during hikes or boat rides.
3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots.
4. Respect local customs and nature regulations by not littering or disturbing wildlife.

By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.

视频对话

## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids. 
# To achieve this, you need to organize your video data into two corresponding sequences: 
#   frames: List[Image]
#   temporal_ids: List[List[Int]].

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu    # pip install decord
from scipy.spatial import cKDTree
import numpy as np
import math

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True,  # or openbmb/MiniCPM-o-2_6
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True)  # or openbmb/MiniCPM-o-2_6

MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING.
MAX_NUM_PACKING=3  # indicates the maximum packing number of video frames. valid range: 1-6
TIME_SCALE = 0.1 

def map_to_nearest_scale(values, scale):
    tree = cKDTree(np.asarray(scale)[:, None])
    _, indices = tree.query(np.asarray(values)[:, None])
    return np.asarray(scale)[indices]


def group_array(arr, size):
    return [arr[i:i+size] for i in range(0, len(arr), size)]

def encode_video(video_path, choose_fps=3, force_packing=None):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    fps = vr.get_avg_fps()
    video_duration = len(vr) / fps
        
    if choose_fps * int(video_duration) <= MAX_NUM_FRAMES:
        packing_nums = 1
        choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration))
        
    else:
        packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES)
        if packing_nums <= MAX_NUM_PACKING:
            choose_frames = round(video_duration * choose_fps)
        else:
            choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING)
            packing_nums = MAX_NUM_PACKING

    frame_idx = [i for i in range(0, len(vr))]      
    frame_idx =  np.array(uniform_sample(frame_idx, choose_frames))

    if force_packing:
        packing_nums = min(force_packing, MAX_NUM_PACKING)
    
    print(video_path, ' duration:', video_duration)
    print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}')
    
    frames = vr.get_batch(frame_idx).asnumpy()

    frame_idx_ts = frame_idx / fps
    scale = np.arange(0, video_duration, TIME_SCALE)

    frame_ts_id = map_to_nearest_scale(frame_idx_ts, scale) / TIME_SCALE
    frame_ts_id = frame_ts_id.astype(np.int32)

    assert len(frames) == len(frame_ts_id)

    frames = [Image.fromarray(v.astype('uint8')).convert('RGB') for v in frames]
    frame_ts_id_group = group_array(frame_ts_id, packing_nums)
    
    return frames, frame_ts_id_group


video_path="video_test.mp4"
fps = 5 # fps for video
force_packing = None # You can set force_packing to ensure that 3D packing is forcibly enabled; otherwise, encode_video will dynamically set the packing quantity based on the duration.
frames, frame_ts_id_group = encode_video(video_path, fps, force_packing=force_packing)

question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]


answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    use_image_id=False,
    max_slice_nums=1,
    temporal_ids=frame_ts_id_group
)
print(answer)

与多张图片对话

点击查看运行 MiniCPM-V 4.5 处理多张图片输入的 Python 代码。

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True)

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

上下文少样本学习

点击查看运行 MiniCPM-V 4.5 少样本输入的 Python 代码。

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True)

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

许可证

模型许可证

MiniCPM-o/V 模型权重和代码基于 Apache-2.0 许可证开源。
为了帮助我们更好地了解和支持用户，若您能考虑选择性填写一份简短的注册"问卷"，我们将不胜感激。

声明

作为一个大型多模态模型（LMM），MiniCPM-V 4.5 通过学习大量多模态语料来生成内容，但它无法理解、表达个人观点或做出价值判断。MiniCPM-V 4.5 生成的任何内容均不代表模型开发者的观点和立场。
对于因使用 MiniCPM-V 模型而产生的任何问题，包括但不限于数据安全问题、舆论风险，或因模型的误导、滥用、传播不当所引发的任何风险和问题，我们不承担责任。

核心技术及其他多模态项目

👏 欢迎探索 MiniCPM-V 4.5 的核心技术以及我们团队的其他多模态项目：

VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V

引用

如果您觉得我们的工作有帮助，请考虑引用我们的论文 📝 并为该项目点赞 ❤️！

@misc{yu2025minicpmv45cookingefficient,
      title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe}, 
      author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and Bokai Xu and Junbo Cui and Yingjing Xu and Liqing Ruan and Luoyuan Zhang and Hanyu Liu and Jingkun Tang and Hongyuan Liu and Qining Guo and Wenhao Hu and Bingxiang He and Jie Zhou and Jie Cai and Ji Qi and Zonghao Guo and Chi Chen and Guoyang Zeng and Yuxuan Li and Ganqu Cui and Ning Ding and Xu Han and Yuan Yao and Zhiyuan Liu and Maosong Sun},
      year={2025},
      eprint={2509.18154},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.18154}, 
}

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={Nat Commun 16, 5509 (2025)},
  year={2025}
}