MiniCPM-V-4.6-Thinking-GPTQ:可用于手机等端侧设备进行高效图像和视频理解，支持复杂多模态推理、数学及OCR任务。是MiniCPM-V 4.6 Thinking的GPTQ量化版本，具备显式推理链生成能力，保持边缘友好架构。【此简介由AI生成】

OpenBMB 开源社区/MiniCPM-V-4.6-Thinking-GPTQ

本仓库托管了 MiniCPM-V 4.6 Thinking 的 GPTQ（W4A16，GPTQModel）量化版本。 如需原始 BF16 权重和完整模型卡片，请参考 openbmb/MiniCPM-V-4.6-Thinking。

一款口袋大小的 MLLM，助力手机端实现超高效图像与视频理解

MiniCPM-V 4.6 Thinking

MiniCPM-V 4.6 Thinking 是 MiniCPM-V 4.6 的长链思维推理变体。它在生成最终答案前会产生显式的推理轨迹，显著提升了在复杂多模态推理、数学以及 OCR 密集型任务上的性能，同时保持了与 MiniCPM-V 4.6 相同的边缘友好型架构（SigLIP2-400M 视觉编码器 + Qwen3.5-0.8B LLM）和 4x/16x 混合视觉令牌压缩技术。

评估

整体性能（Thinking 版本）

点击查看 MiniCPM-V 4.6（Instruct 版本）性能。

点击查看 MiniCPM-V 4.6 推理效率结果。

高并发吞吐量

单请求首字符输出时间（TTFT，毫秒）

示例

总体展示

MiniCPM-V 4.6 可在三大主流端侧平台部署——iOS、Android 和 HarmonyOS。以下为手机设备的原始屏幕录制片段，未经任何编辑。

iPhone _{iPhone 17 Pro Max}	Android _{Redmi K70}	HarmonyOS _{HUAWEI nova 14}

使用方法

使用 Transformers 进行推理

安装

pip install "transformers[torch]>=5.7.0" torchvision torchcodec

关于 CUDA 兼容性的说明： torchcodec（用于视频解码）可能与某些 CUDA 版本存在兼容性问题。例如，torch>=2.11 默认捆绑 CUDA 13.1，而 CUDA 12.x 环境可能会遇到诸如 RuntimeError: Could not load libtorchcodec 之类的错误。两种解决方法：
将 torchcodec 替换为 PyAV — 支持图像和视频推理，且无 CUDA 版本限制：
pip install "transformers[torch]>=5.7.0" torchvision av
安装 torch 时固定 CUDA 版本 以匹配您的环境（例如 CUDA 12.8）：
pip install "transformers>=5.7.0" torchvision torchcodec --index-url https://download.pytorch.org/whl/cu128

加载模型

from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "openbmb/MiniCPM-V-4.6-Thinking-GPTQ"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)

# Flash Attention 2 is recommended for better acceleration and memory saving,
# especially in multi-image and video scenarios.
# model = AutoModelForImageTextToText.from_pretrained(
#     model_id,
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

图像推理

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"},
            {"type": "text", "text": "What causes this phenomenon?"},
        ],
    }
]

downsample_mode = "16x"  # Using `downsample_mode="4x"` for Finer Detail

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
    downsample_mode=downsample_mode,
    max_slice_nums=36,
).to(model.device)

generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

视频推理

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/football.mp4"},
            {"type": "text", "text": "Describe this video in detail. Follow the timeline and focus on on-screen text, interface changes, main actions, and scene changes."},
        ],
    }
]

downsample_mode = "16x"  # Using `downsample_mode="4x"` for Finer Detail

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
    downsample_mode=downsample_mode,
    max_num_frames=128,
    stack_frames=1,
    max_slice_nums=1,
    use_image_id=False,
).to(model.device)

generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

高级参数

您可以通过向 apply_chat_template 传递额外参数来自定义图像/视频处理：

参数	默认值	适用对象	描述
`downsample_mode`	`"16x"`	图像和视频	视觉 token 下采样。`"16x"` 合并 token 以提高效率；`"4x"` 保留 4 倍更多 token 以呈现更精细的细节。此参数也必须传递给 `generate()`。
`max_slice_nums`	`9`	图像和视频	分割高分辨率图像时的最大切片数量。值越高，大型图像保留的细节越多。建议：图像使用 `36`，视频使用 `1`。
`max_num_frames`	`128`	仅视频	从视频中采样的主帧最大数量。
`stack_frames`	`1`	仅视频	每秒的总采样点数。`1` = 仅主帧（不堆叠）。`N`（N>1）= 每秒 1 个主帧 + N−1 个子帧；子帧将合成为网格图像并与主帧交错。建议：`3` 或 `5`。
`use_image_id`	`True`	图像和视频	是否在每个图像/帧占位符前添加 `<image_id>N</image_id>` 标签。建议：图像使用 `True`，视频使用 `False`。

注意： downsample_mode 必须同时传递给 apply_chat_template（以确保占位符数量正确）和 generate（供视觉编码器使用）。所有其他参数只需传递给 apply_chat_template。

使用 `transformers serve` 部署

Hugging Face Transformers 包含一个轻量级 OpenAI 兼容服务器，适用于快速测试和中等负载部署。

pip install "transformers[serving]>=5.7.0"

启动服务器：

transformers serve openbmb/MiniCPM-V-4.6-Thinking-GPTQ --port 8000 --host 0.0.0.0 --continuous-batching

发送请求：

curl -s http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "openbmb/MiniCPM-V-4.6-Thinking-GPTQ",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
        {"type": "text", "text": "What causes this phenomenon?"}
      ]
    }]
  }'

处理模型输出中的转义换行符

在某些情况下，模型可能会将转义换行符 \n 作为字符串字面量输出，而非实际的换行符。为了正确呈现文本（尤其是在 UI 层），您可以使用以下工具函数。该函数会谨慎地将字面量 \n 替换为实际换行符，同时保护 \n 具有特定语义的场景。

工具函数：

import re

_PATTERN = re.compile(
    r'(```[\s\S]*?```'       # fenced code blocks
    r'|`[^`]+`'              # inline code
    r'|\$\$[\s\S]*?\$\$'     # display math
    r'|\$[^$]+\$'            # inline math
    r'|\\$[\s\S]*?\\$'     # $...$
    r'|\\

$$[\s\S]*?\\$$

'     # 

$$...$$


    r')'
    r'|(?<!\\)(?:\\r\\n|\\[nr])'
)

def normalize_response_text(text: str) -> str:
    """
    Lightweight post-processing: Converts literal '\\n' to actual newlines, 
    while protecting code blocks, inline code, and LaTeX commands.
    """
    if not isinstance(text, str) or "\\" not in text:
        return text
    return _PATTERN.sub(lambda m: m.group(1) or '\n', text)

在 iOS、Android 和 HarmonyOS 平台部署 MiniCPM-V 4.6

我们已对 MiniCPM-V 4.6 进行适配，使其可部署在 iOS、Android 和 HarmonyOS 平台，所有边缘端适配代码均已完全开源。开发者只需几步即可复现端侧体验。访问我们的边缘部署仓库获取各平台的构建指南，或前往下载页面直接试用预构建应用。

在其他推理和训练框架中使用 MiniCPM-V 4.6

MiniCPM-V 4.6 支持多种推理和训练框架。以下是各框架的快速启动命令。完整详情，请参阅我们的使用指南。

vLLM — 完整指南

vllm serve openbmb/MiniCPM-V-4.6-Thinking-GPTQ \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --default-chat-template-kwargs '{"enable_thinking": true}'

注意：--enable-auto-tool-choice 和 --tool-call-parser qwen3_coder 用于启用工具/函数调用支持。如果不需要使用工具，可以省略这些参数，直接运行 vllm serve openbmb/MiniCPM-V-4.6-Thinking-GPTQ。

curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "openbmb/MiniCPM-V-4.6-Thinking-GPTQ",
  "messages": [{"role": "user", "content": [
    {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
    {"type": "text", "text": "What causes this phenomenon?"}
  ]}]
}'

工具调用示例：

curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "openbmb/MiniCPM-V-4.6-Thinking-GPTQ",
  "messages": [{"role": "user", "content": [
    {"type": "text", "text": "北京的天气"}
  ]}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get the current weather for a given location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {"type": "string", "description": "City name"}
        },
        "required": ["location"]
      }
    }
  }]
}'

SGLang — 完整指南

python -m sglang.launch_server --model openbmb/MiniCPM-V-4.6-Thinking-GPTQ --port 30000

curl -s http://localhost:30000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "openbmb/MiniCPM-V-4.6-Thinking-GPTQ",
  "messages": [{"role": "user", "content": [
    {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
    {"type": "text", "text": "What causes this phenomenon?"}
  ]}]
}'

llama.cpp — 完整指南

llama-server -m MiniCPM-V-4.6-Q4_K_M.gguf --port 8080

curl -s http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "MiniCPM-V-4.6",
  "messages": [{"role": "user", "content": [
    {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
    {"type": "text", "text": "What causes this phenomenon?"}
  ]}]
}'

Ollama — 完整指南

ollama run minicpm-v-4.6-thinking

在交互会话中，直接粘贴图片路径或 URL 即可与模型对话。

LLaMA-Factory（微调）— 完整指南

llamafactory-cli train examples/train_lora/minicpmv4_6_lora_sft.yaml

ms-swift（微调）— 完整指南

swift sft --model_type minicpm-v-4_6 --dataset <your-dataset>

许可证

模型许可证

MiniCPM-o/V 模型权重及代码基于 Apache-2.0 许可证开源。

声明

MiniCPM-o/V 模型作为多模态大语言模型，通过学习海量多模态语料生成内容，但不具备理解能力、个人观点表达能力或价值判断能力。MiniCPM-o/V 模型生成的任何内容均不代表模型开发团队的观点和立场。
对于因使用 MiniCPM-o/V 模型而产生的任何问题，包括但不限于数据安全问题、舆情风险，或因模型误导、滥用、传播及误用所引发的任何风险和问题，我们不承担任何责任。

技术报告与核心技术论文

👏 欢迎探索 MiniCPM-o/V 的核心技术及团队其他多模态项目：

技术报告： MiniCPM-o 4.5 | MiniCPM-V 4.5 | MiniCPM-o 2.6 | MiniCPM-Llama3-V 2.5 | MiniCPM-V 2.0

其他多模态项目： VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V

引用

如果您觉得我们的模型/代码/论文对您有所帮助，欢迎引用我们的论文 📝 并给我们点星 ⭐️！

@misc{cui2026minicpmo45realtimefullduplex,
      title={MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction}, 
      author={Junbo Cui and Bokai Xu and Chongyi Wang and Tianyu Yu and Weiyue Sun and Yingjing Xu and Tianran Wang and Zhihui He and Wenshuo Ma and Tianchi Cai and others},
      year={2026},
      url={https://arxiv.org/abs/2604.27393}, 
}

@proceedings{yu2025minicpmv45cookingefficient,
      title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe}, 
      author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and others},
      year={2025},
      url={https://arxiv.org/abs/2509.18154}, 
}

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={arXiv preprint arXiv:2408.01800},
  year={2024}
}

本仓库托管了 MiniCPM-V 4.6 Thinking 的 GPTQ（W4A16，GPTQModel）量化版本。 如需原始 BF16 权重和完整模型卡片，请参考 openbmb/MiniCPM-V-4.6-Thinking。

一款口袋大小的 MLLM，助力手机端实现超高效图像与视频理解

GitHub | CookBook | Demo | 飞书 (Lark)