MiniCPM4 系列是专为端侧设备设计的高效能大语言模型(LLMs),通过在模型架构、训练数据、训练算法和推理系统四个关键维度的系统性创新,实现了高效能目标。
BitCPM4是基于MiniCPM系列模型通过量化感知训练(QAT)得到的三值量化模型,在训练效率和模型参数效率方面均实现了显著提升。
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
path = "openbmb/BitCPM4-0.5B"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
messages = [
{"role": "user", "content": "推荐5个北京的景点。"},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
model_outputs = model.generate(
model_inputs,
max_new_tokens=1024,
top_p=0.7,
temperature=0.7
)
output_token_ids = [
model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
]
responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)from vllm import LLM, SamplingParams
llm = LLM(
model="openbmb/BitCPM4-0.5B", # or local path
max_model_len=1024,
gpu_memory_utilization=0.7,
enforce_eager=True,
trust_remote_code=True,
)
sampling_params = SamplingParams(max_tokens=128, temperature=0.7)
outputs = llm.generate(["What is artificial intelligence?"], sampling_params)
print(outputs[0].outputs[0].text)BitCPM4 的性能与同模型规模的其他全精度模型相当。

本部分针对 BitCPM4-0.5B 在华为昇腾 Ascend 910B2 NPU 上通过 vLLM-Ascend 进行推理的适配验证与性能评估。
| 项目 | 规格 |
|---|---|
| 硬件 | |
| NPU 型号 | Ascend910_9362 |
| NPU 数量 | 2 |
| NPU 显存 | 61.3 GiB/卡 |
| CPU | aarch64, 40 核 |
| 系统内存 | 229.4 GiB |
| 软件 | |
| OS | Linux 5.10.0 (HCE 2.0) |
| Python | 3.11.14 |
| PyTorch | 2.9.0+cpu |
| torch_npu | 2.9.0.post1 |
| vLLM | 0.18.0 |
| vLLM-Ascend | 0.18.0rc1 |
| CANN | 8.5.1 |
| transformers | 4.57.6 |
| 参数 | 值 | 说明 |
|---|---|---|
| 架构 | MiniCPMForCausalLM | vLLM 原生支持 |
| 参数量 | 433.9M (0.5B) | |
| hidden_size | 1024 | |
| num_hidden_layers | 24 | |
| num_attention_heads | 16 | |
| num_key_value_heads | 2 | GQA (压缩比 8:1) |
| intermediate_size | 4096 | |
| max_position_embeddings | 32768 | LongRoPE 缩放 |
| vocab_size | 73448 | |
| torch_dtype | bfloat16 | |
| 权重精度 | BF16 | |
| 权重文件大小 | 867 MB | 1 个 safetensors 分片 |
✅ 零代码修改,完全适配成功
BitCPM4-0.5B 使用 MiniCPMForCausalLM 架构,该架构已被 vLLM 和 vLLM-Ascend 原生支持。所有 MiniCPM 特有参数(scale_emb、scale_depth、dim_model_base、longrope RoPE 缩放等)均被正确解析。
以 CPU (PyTorch transformers + float32) 推理结果为参考基线,对比 NPU (vLLM-Ascend + bfloat16) 推理结果的 top-k token 预测概率。使用 7 组不同主题的测试 prompt,分别采集 CPU 和 NPU 的 next-token 预测分布进行逐项比对。
| Prompt | CPU Top-1 Token | CPU Prob | NPU Top-1 Token | NPU Prob | Top-1 概率误差 | Top-10 重合度 |
|---|---|---|---|---|---|---|
| The capital of France is | a | 0.3537 | a | 0.3524 | 0.38% | 10/10 ✅ |
| Einstein is known for | his | 0.7679 | his | 0.7641 | 0.49% | 10/10 ✅ |
| Quantum computing is | a | 0.7091 | a | 0.7051 | 0.57% | 10/10 ✅ |
| The meaning of life is | a | 0.9718 | a | 0.9722 | 0.04% | 10/10 ✅ |
| Machine learning is a | subset | 0.4118 | subset | 0.3963 | 3.76% | 10/10 ✅ |
| Python is a programming | language | 0.9973 | language | 0.9970 | 0.03% | 10/10 ✅ |
| Natural language processing | ( | 0.6236 | ( | 0.6271 | 0.57% | 10/10 ✅ |
| 指标 | 值 |
|---|---|
| Top-1 Token 匹配率 | 100% (7/7) ✅ |
| Top-5 Token 包含率 | 100% (7/7) ✅ |
| Top-10 Token 重合度 | 100% (平均 10.0/10) ✅ |
| Top-1 概率平均相对误差 | 0.84% ✅ (< 1%) |
| Top-1 概率最大相对误差 | 3.76% (Prompt: "Machine learning is a") |
| Top-10 概率平均绝对误差 (MAE) | 0.00114 |
| Top-10 概率最大绝对误差 | 0.01550 |
| Top-10 概率均方根误差 (RMSE) | 0.00283 |
结论: NPU (bfloat16) 输出与 CPU (float32) 参考的 Top-1 预测完全一致,概率平均误差 0.84%,远低于 1% 的精度偏差阈值。极个别 prompt 的 top-1 概率误差达到 3.76%,但这是因为 CPU top-1 概率本身较低(
subset0.41 vs NPU 0.40),绝对差值仅 0.0155,不影响最终 token 选择。整体精度完全满足生产部署要求。
Prompt: "The capital of France is"
Rank | CPU Token | CPU Prob | NPU Token | NPU Prob | Match
-----|-------------|----------|-------------|----------|------
1 | 'a' | 0.353706 | 'a' | 0.352359 | ✅
2 | 'Paris' | 0.275060 | 'Paris' | 0.274417 | ✅
3 | 'known' | 0.052770 | 'known' | 0.054036 | ✅
4 | 'the' | 0.034749 | 'the' | 0.032774 | ✅
5 | 'not' | 0.021897 | 'not' | 0.022526 | ✅
6 | 'named' | 0.018752 | 'located' | 0.018674 |
7 | 'located' | 0.018359 | 'named' | 0.018674 |
8 | 'widely' | 0.014278 | 'widely' | 0.014544 | ✅
9 | 'an' | 0.013214 | 'an' | 0.013662 | ✅
10 | 'held' | 0.012054 | 'held' | 0.012835 | ✅| max_tokens | 总耗时 | Input 吞吐 | Output 吞吐 |
|---|---|---|---|
| 32 | 1.07s | 59.6 tok/s | 298.2 tok/s |
| 64 | 1.95s | 32.8 tok/s | 316.3 tok/s |
| 128 | 3.81s | 16.8 tok/s | 336.4 tok/s |
| 指标 | 值 |
|---|---|
| Input tokens | 11 |
| Output tokens | 128 |
| 总耗时 | 3.67s |
| 生成吞吐 | 34.8 tok/s |
| 指标 | 值 |
|---|---|
| Prefill 吞吐 | ~4273 tok/s |
| Decode 延迟 | ~59ms/token |
| 指标 | 值 |
|---|---|
| 模型加载时间 | 0.33s |
| 模型权重显存 | 0.82 GiB |
| KV Cache 容量 | 41.78 GiB |
| 最大并发 (1024 tokens/req) | 3565× |
| Engine 初始化时间 | 3.67s |
| 总 NPU 显存占用 | 43.29 GiB / 61.27 GiB |
Input: "What is artificial intelligence?" Output:
Artificial intelligence (AI) is the simulation of human intelligence processes by computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction.
Input: "请用中文简单介绍一下你自己。" Output:
我是 MiniCPM4,一个高效的语言模型,可以在各种设备上...
@article{minicpm4,
title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
author={MiniCPM Team},
year={2025}
}