本仓库托管 MiniCPM-V 4.6 的 GGUF(llama.cpp)量化版本。 如需原始 BF16 权重和完整模型卡片,请参考 openbmb/MiniCPM-V-4.6。
一款适用于手机端的超高效图像与视频理解微型多模态大模型
MiniCPM-V 4.6 是我们迄今为止最适合边缘部署的模型。该模型基于 SigLIP2-400M 和 Qwen3.5-0.8B LLM 构建。它继承了 MiniCPM-V 系列强大的单图像、多图像和视频理解能力,同时显著提升了计算效率。此外,它还引入了 4x/16x 混合视觉令牌压缩技术。MiniCPM-V 4.6 的显著特点包括:
🔥 领先的基础能力。 MiniCPM-V 4.6 在 Artificial Analysis Intelligence Index 基准测试中得分为 13,以少 19 倍的令牌成本优于 Qwen3.5-0.8B(得分 10),以少 43 倍的令牌成本优于 Qwen3.5-0.8B-Thinking(得分 11)。它还超越了更大的 Ministral 3 3B(得分 11)。
💪 强大的多模态能力。 MiniCPM-V 4.6 在大多数视觉语言理解任务上表现优于 Qwen3.5-0.8B,并在包括 OpenCompass、RefCOCO、HallusionBench、MUIRBench 和 OCRBench 在内的许多基准测试中达到 Qwen3.5 2B 级别的能力。
🚀 超高效架构。 基于 LLaVA-UHD v4 中的最新技术,MiniCPM-V 4.6 将视觉编码计算 FLOPs 降低了 50% 以上。这使得 MiniCPM-V 4.6 甚至比更小的模型效率更高,令牌吞吐量达到 Qwen3.5-0.8B 的约 1.5 倍。 它还支持 4x/16x 混合视觉令牌压缩率,允许在精度和速度之间灵活切换。
📱 广泛的移动平台覆盖。 MiniCPM-V 4.6 可部署在 iOS、Android 和 HarmonyOS 这三大主流移动平台上。所有边缘适配代码均开源,开发者只需几个步骤即可复现端侧体验。
🛠️ 对开发者友好。 MiniCPM-V 4.6 已适配 推理框架,如 vLLM、SGLang、llama.cpp、Ollama,并支持 微调生态系统,如 SWIFT 和 LLaMA-Factory。开发者可在消费级 GPU 上快速为新领域和新任务定制模型。我们提供多种量化变体,涵盖 GGUF、BNB、AWQ 和 GPTQ 格式。
整体性能(指令遵循)
高并发吞吐量
单请求首字符输出时间(毫秒)
MiniCPM-V 4.6 可在三大主流端侧平台部署——iOS、Android 和 HarmonyOS。以下为手机设备上未经剪辑的原始屏幕录制片段。
| iPhone iPhone 17 Pro Max | Android Redmi K70 | HarmonyOS HUAWEI nova 14 |
![]() | ![]() | ![]() |
pip install "transformers[torch]>=5.7.0" torchvision torchcodec关于 CUDA 兼容性的说明:
torchcodec(用于视频解码)可能与某些 CUDA 版本存在兼容性问题。例如,torch>=2.11默认捆绑 CUDA 13.1,而 CUDA 12.x 环境可能会遇到诸如RuntimeError: Could not load libtorchcodec之类的错误。两种解决方法:
- 用
PyAV替换torchcodec— 支持图像和视频推理,且无 CUDA 版本限制:pip install "transformers[torch]>=5.7.0" torchvision av- 安装 torch 时固定 CUDA 版本 以匹配您的环境(例如 CUDA 12.8):
pip install "transformers>=5.7.0" torchvision torchcodec --index-url https://download.pytorch.org/whl/cu128
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "openbmb/MiniCPM-V-4.6"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
# Flash Attention 2 is recommended for better acceleration and memory saving,
# especially in multi-image and video scenarios.
# model = AutoModelForImageTextToText.from_pretrained(
# model_id,
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"},
{"type": "text", "text": "What causes this phenomenon?"},
],
}
]
downsample_mode = "16x" # Using `downsample_mode="4x"` for Finer Detail
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
downsample_mode=downsample_mode,
max_slice_nums=36,
).to(model.device)
generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])messages = [
{
"role": "user",
"content": [
{"type": "video", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/football.mp4"},
{"type": "text", "text": "Describe this video in detail. Follow the timeline and focus on on-screen text, interface changes, main actions, and scene changes."},
],
}
]
downsample_mode = "16x" # Using `downsample_mode="4x"` for Finer Detail
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
downsample_mode=downsample_mode,
max_num_frames=128,
stack_frames=1,
max_slice_nums=1,
use_image_id=False,
).to(model.device)
generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])您可以通过向 apply_chat_template 传递额外参数来自定义图像/视频处理:
| 参数 | 默认值 | 适用对象 | 描述 |
|---|---|---|---|
downsample_mode | "16x" | 图像和视频 | 视觉 token 下采样。"16x" 合并 token 以提高效率;"4x" 保留 4 倍更多 token 以呈现更精细的细节。该参数也必须传递给 generate()。 |
max_slice_nums | 9 | 图像和视频 | 分割高分辨率图像时的最大切片数量。值越高,大图像保留的细节越多。推荐值:图像为 36,视频为 1。 |
max_num_frames | 128 | 仅视频 | 从视频中采样的最大主帧数。 |
stack_frames | 1 | 仅视频 | 每秒的总采样点数。1 = 仅主帧(不堆叠)。N(N>1)= 每秒 1 个主帧 + N-1 个子帧;子帧将合成为网格图像并与主帧交错。推荐值:3 或 5。 |
use_image_id | True | 图像和视频 | 是否在每个图像/帧占位符前添加 <image_id>N</image_id> 标签。推荐值:图像为 True,视频为 False。 |
注意:
downsample_mode必须同时传递给apply_chat_template(以确保占位符数量正确)和generate(供视觉编码器使用)。所有其他参数只需传递给apply_chat_template。
transformers serve 进行服务部署 Hugging Face Transformers 包含一个轻量级的 OpenAI 兼容服务器,适用于快速测试和中等负载部署。
pip install "transformers[serving]>=5.7.0"启动服务器:
transformers serve openbmb/MiniCPM-V-4.6 --port 8000 --host 0.0.0.0 --continuous-batching发送请求:
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "openbmb/MiniCPM-V-4.6",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
{"type": "text", "text": "What causes this phenomenon?"}
]
}]
}'在某些情况下,模型可能会将转义换行符 \n 作为字符串字面量输出,而非实际的换行符。为了正确呈现文本(尤其是在 UI 层),您可以使用以下工具函数。该函数会谨慎地将字面量 \n 替换为实际换行符,同时保护 \n 具有特定语义的场景。
工具函数:
import re
_PATTERN = re.compile(
r'(```[\s\S]*?```' # fenced code blocks
r'|`[^`]+`' # inline code
r'|\$\$[\s\S]*?\$\$' # display math
r'|\$[^$]+\$' # inline math
r'|\\$[\s\S]*?\\$' # $...$
r'|\\
$$[\s\S]*?\\$$
' #
$$...$$
r')'
r'|(?<!\\)(?:\\r\\n|\\[nr])'
)
def normalize_response_text(text: str) -> str:
"""
Lightweight post-processing: Converts literal '\\n' to actual newlines,
while protecting code blocks, inline code, and LaTeX commands.
"""
if not isinstance(text, str) or "\\" not in text:
return text
return _PATTERN.sub(lambda m: m.group(1) or '\n', text)我们已对 MiniCPM-V 4.6 进行适配,使其可部署在 iOS、Android 和 HarmonyOS 平台,所有边缘端适配代码均已完全开源。开发者仅需几步操作即可复现端侧体验。请访问我们的 边缘部署仓库 获取各平台的构建指南,或前往 下载页面 直接试用预构建应用。
MiniCPM-V 4.6 支持多种推理和训练框架。以下是各框架的快速启动命令。完整详情请参见我们的 使用手册。
vllm serve openbmb/MiniCPM-V-4.6 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--default-chat-template-kwargs '{"enable_thinking": false}'注意:
--enable-auto-tool-choice和--tool-call-parser qwen3_coder用于启用工具/函数调用支持。如果不需要使用工具,可以省略这些标志,直接运行vllm serve openbmb/MiniCPM-V-4.6。
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "openbmb/MiniCPM-V-4.6",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
{"type": "text", "text": "What causes this phenomenon?"}
]}]
}'工具调用示例:
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "openbmb/MiniCPM-V-4.6",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "北京的天气"}
]}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a given location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}]
}'python -m sglang.launch_server --model openbmb/MiniCPM-V-4.6 --port 30000curl -s http://localhost:30000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "openbmb/MiniCPM-V-4.6",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
{"type": "text", "text": "What causes this phenomenon?"}
]}]
}'llama-server -m MiniCPM-V-4.6-Q4_K_M.gguf --port 8080curl -s http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "MiniCPM-V-4.6",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
{"type": "text", "text": "What causes this phenomenon?"}
]}]
}'ollama run minicpm-v-4.6在交互会话中,直接粘贴图片路径或 URL 即可与模型对话。
llamafactory-cli train examples/train_lora/minicpmv4_6_lora_sft.yamlswift sft --model_type minicpm-v-4_6 --dataset <your-dataset>👏 欢迎探索 MiniCPM-o/V 的核心技术以及我们团队的其他多模态项目:
技术报告: MiniCPM-o 4.5 | MiniCPM-V 4.5 | MiniCPM-o 2.6 | MiniCPM-Llama3-V 2.5 | MiniCPM-V 2.0
其他多模态项目: VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V
如果您觉得我们的模型/代码/论文对您有帮助,请考虑引用我们的论文 📝 并给我们点星 ⭐️!
@misc{cui2026minicpmo45realtimefullduplex,
title={MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction},
author={Junbo Cui and Bokai Xu and Chongyi Wang and Tianyu Yu and Weiyue Sun and Yingjing Xu and Tianran Wang and Zhihui He and Wenshuo Ma and Tianchi Cai and others},
year={2026},
url={https://arxiv.org/abs/2604.27393},
}
@proceedings{yu2025minicpmv45cookingefficient,
title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe},
author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and others},
year={2025},
url={https://arxiv.org/abs/2509.18154},
}
@article{yao2024minicpm,
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
journal={arXiv preprint arXiv:2408.01800},
year={2024}
}