BitCPM4-0.5B:可用于在端侧设备或华为昇腾 NPU 上实现高效文本生成。项目通过量化感知训练将模型参数压缩为三元值，保持与同规模全精度模型相当性能，支持 vLLM-Ascend 推理，适配昇腾 910B2 NPU 并实现高精度低误差运行。【此简介由AI生成】

TCFY7/BitCPM4-0.5B

GitHub 代码库 | 技术报告

👋 加入我们的 Discord 和微信社群

MiniCPM4 系列

MiniCPM4 系列是专为端侧设备设计的高效能大语言模型（LLMs），通过在模型架构、训练数据、训练算法和推理系统四个关键维度的系统性创新，实现了高效能目标。

MiniCPM4-8B：MiniCPM4 的旗舰模型，具有 80 亿参数，在 8T tokens 上训练。
MiniCPM4-0.5B：MiniCPM4 的小型版本，具有 0.5B 参数，在 1T tokens 上训练。
MiniCPM4-8B-Eagle-FRSpec：用于 FRSpec 的 Eagle 引导模型，加速 MiniCPM4-8B 的推测式推理。
MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu：采用 QAT 训练的 FRSpec Eagle 引导模型，高效整合推测与量化，为 MiniCPM4-8B 实现极致加速。
MiniCPM4-8B-Eagle-vLLM：vLLM 格式的 Eagle 引导模型，加速 MiniCPM4-8B 的推测式推理。
MiniCPM4-8B-marlin-Eagle-vLLM：vLLM 格式的量化 Eagle 引导模型，加速 MiniCPM4-8B 的推测式推理。
BitCPM4-0.5B：对 MiniCPM4-0.5B 应用极致三值量化，将模型参数压缩为三值。（⬅ 您当前所在位置）
BitCPM4-1B：对 MiniCPM3-1B 应用极致三值量化。
MiniCPM4-Survey：基于 MiniCPM4-8B，可自主生成可信的长篇综述论文。
MiniCPM4-MCP：基于 MiniCPM4-8B，可自主调用相关 MCP 工具。

简介

BitCPM4是基于MiniCPM系列模型通过量化感知训练（QAT）得到的三值量化模型，在训练效率和模型参数效率方面均实现了显著提升。

训练方法的改进
- 在小模型上通过风洞实验搜索超参数。
- 采用两阶段训练方法：先进行高精度训练，再进行QAT。
高参数效率
- 以仅1.58比特的位宽，实现了与相似参数量全精度模型相当的性能。

使用方法

使用Transformers进行推理

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

path = "openbmb/BitCPM4-0.5B"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

messages = [
    {"role": "user", "content": "推荐5个北京的景点。"},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)

model_outputs = model.generate(
    model_inputs,
    max_new_tokens=1024,
    top_p=0.7,
    temperature=0.7
)

output_token_ids = [
    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)

使用 vLLM-Ascend（华为昇腾 NPU）进行推理

from vllm import LLM, SamplingParams

llm = LLM(
    model="openbmb/BitCPM4-0.5B",  # or local path
    max_model_len=1024,
    gpu_memory_utilization=0.7,
    enforce_eager=True,
    trust_remote_code=True,
)

sampling_params = SamplingParams(max_tokens=128, temperature=0.7)
outputs = llm.generate(["What is artificial intelligence?"], sampling_params)
print(outputs[0].outputs[0].text)

评测结果（原始数据）

BitCPM4 的性能与同模型规模的其他全精度模型相当。 BitCPM 基准测试

Ascend NPU 适配与评测报告

本部分针对 BitCPM4-0.5B 在华为昇腾 Ascend 910B2 NPU 上通过 vLLM-Ascend 进行推理的适配验证与性能评估。

环境信息

项目	规格
硬件
NPU 型号	Ascend910_9362
NPU 数量	2
NPU 显存	61.3 GiB/卡
CPU	aarch64, 40 核
系统内存	229.4 GiB
软件
OS	Linux 5.10.0 (HCE 2.0)
Python	3.11.14
PyTorch	2.9.0+cpu
torch_npu	2.9.0.post1
vLLM	0.18.0
vLLM-Ascend	0.18.0rc1
CANN	8.5.1
transformers	4.57.6

模型配置

参数	值	说明
架构	`MiniCPMForCausalLM`	vLLM 原生支持
参数量	433.9M (0.5B)
hidden_size	1024
num_hidden_layers	24
num_attention_heads	16
num_key_value_heads	2	GQA (压缩比 8:1)
intermediate_size	4096
max_position_embeddings	32768	LongRoPE 缩放
vocab_size	73448
torch_dtype	bfloat16
权重精度	BF16
权重文件大小	867 MB	1 个 safetensors 分片

适配结论

✅ 零代码修改，完全适配成功

BitCPM4-0.5B 使用 MiniCPMForCausalLM 架构，该架构已被 vLLM 和 vLLM-Ascend 原生支持。所有 MiniCPM 特有参数（scale_emb、scale_depth、dim_model_base、longrope RoPE 缩放等）均被正确解析。

精度评测

评测方法

以 CPU (PyTorch transformers + float32) 推理结果为参考基线，对比 NPU (vLLM-Ascend + bfloat16) 推理结果的 top-k token 预测概率。使用 7 组不同主题的测试 prompt，分别采集 CPU 和 NPU 的 next-token 预测分布进行逐项比对。

精度对比表

Prompt	CPU Top-1 Token	CPU Prob	NPU Top-1 Token	NPU Prob	Top-1 概率误差	Top-10 重合度
The capital of France is	`a`	0.3537	`a`	0.3524	0.38%	10/10 ✅
Einstein is known for	`his`	0.7679	`his`	0.7641	0.49%	10/10 ✅
Quantum computing is	`a`	0.7091	`a`	0.7051	0.57%	10/10 ✅
The meaning of life is	`a`	0.9718	`a`	0.9722	0.04%	10/10 ✅
Machine learning is a	`subset`	0.4118	`subset`	0.3963	3.76%	10/10 ✅
Python is a programming	`language`	0.9973	`language`	0.9970	0.03%	10/10 ✅
Natural language processing	`(`	0.6236	`(`	0.6271	0.57%	10/10 ✅

精度评测汇总

指标	值
Top-1 Token 匹配率	100% (7/7) ✅
Top-5 Token 包含率	100% (7/7) ✅
Top-10 Token 重合度	100% (平均 10.0/10) ✅
Top-1 概率平均相对误差	0.84% ✅ (< 1%)
Top-1 概率最大相对误差	3.76% (Prompt: "Machine learning is a")
Top-10 概率平均绝对误差 (MAE)	0.00114
Top-10 概率最大绝对误差	0.01550
Top-10 概率均方根误差 (RMSE)	0.00283

结论: NPU (bfloat16) 输出与 CPU (float32) 参考的 Top-1 预测完全一致，概率平均误差 0.84%，远低于 1% 的精度偏差阈值。极个别 prompt 的 top-1 概率误差达到 3.76%，但这是因为 CPU top-1 概率本身较低（subset 0.41 vs NPU 0.40），绝对差值仅 0.0155，不影响最终 token 选择。整体精度完全满足生产部署要求。

Top-10 概率分布对比 (CPU vs NPU)

Prompt: "The capital of France is"
Rank | CPU Token   | CPU Prob | NPU Token   | NPU Prob | Match
-----|-------------|----------|-------------|----------|------
1    | 'a'         | 0.353706 | 'a'         | 0.352359 | ✅
2    | 'Paris'     | 0.275060 | 'Paris'     | 0.274417 | ✅
3    | 'known'     | 0.052770 | 'known'     | 0.054036 | ✅
4    | 'the'       | 0.034749 | 'the'       | 0.032774 | ✅
5    | 'not'       | 0.021897 | 'not'       | 0.022526 | ✅
6    | 'named'     | 0.018752 | 'located'   | 0.018674 | 
7    | 'located'   | 0.018359 | 'named'     | 0.018674 | 
8    | 'widely'    | 0.014278 | 'widely'    | 0.014544 | ✅
9    | 'an'        | 0.013214 | 'an'        | 0.013662 | ✅
10   | 'held'      | 0.012054 | 'held'      | 0.012835 | ✅

性能评测

吞吐量测试 (Batch=10)

max_tokens	总耗时	Input 吞吐	Output 吞吐
32	1.07s	59.6 tok/s	298.2 tok/s
64	1.95s	32.8 tok/s	316.3 tok/s
128	3.81s	16.8 tok/s	336.4 tok/s

单请求延迟 (max_tokens=128)

指标	值
Input tokens	11
Output tokens	128
总耗时	3.67s
生成吞吐	34.8 tok/s

长上下文测试 (252 input tokens)

指标	值
Prefill 吞吐	~4273 tok/s
Decode 延迟	~59ms/token

显存使用

指标	值
模型加载时间	0.33s
模型权重显存	0.82 GiB
KV Cache 容量	41.78 GiB
最大并发 (1024 tokens/req)	3565×
Engine 初始化时间	3.67s
总 NPU 显存占用	43.29 GiB / 61.27 GiB

典型生成示例

示例 1

Input: "What is artificial intelligence?" Output:

Artificial intelligence (AI) is the simulation of human intelligence processes by computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction.

示例 2

Input: "请用中文简单介绍一下你自己。" Output:

我是 MiniCPM4，一个高效的语言模型，可以在各种设备上...

声明

作为语言模型，MiniCPM 通过学习海量文本生成内容。
但它不具备理解能力，也无法表达个人观点或价值判断。
MiniCPM 生成的任何内容均不代表模型开发者的观点或立场。
因此，用户在使用 MiniCPM 生成的内容时，应自行承担全部评估和验证责任。

许可证

本仓库及 MiniCPM 模型基于 Apache-2.0 许可证发布。

引用

如果您认为我们的工作有价值，请引用我们的论文。

@article{minicpm4,
  title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
  author={MiniCPM Team},
  year={2025}
}

GitHub 代码库 | 技术报告

👋 加入我们的 Discord 和微信社群

MiniCPM4 系列

MiniCPM4-8B：MiniCPM4 的旗舰模型，具有 80 亿参数，在 8T tokens 上训练。
MiniCPM4-0.5B：MiniCPM4 的小型版本，具有 0.5B 参数，在 1T tokens 上训练。
MiniCPM4-8B-Eagle-FRSpec：用于 FRSpec 的 Eagle 引导模型，加速 MiniCPM4-8B 的推测式推理。
MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu：采用 QAT 训练的 FRSpec Eagle 引导模型，高效整合推测与量化，为 MiniCPM4-8B 实现极致加速。
MiniCPM4-8B-Eagle-vLLM：vLLM 格式的 Eagle 引导模型，加速 MiniCPM4-8B 的推测式推理。
MiniCPM4-8B-marlin-Eagle-vLLM：vLLM 格式的量化 Eagle 引导模型，加速 MiniCPM4-8B 的推测式推理。
BitCPM4-0.5B：对 MiniCPM4-0.5B 应用极致三值量化，将模型参数压缩为三值。（⬅ 您当前所在位置）
BitCPM4-1B：对 MiniCPM3-1B 应用极致三值量化。
MiniCPM4-Survey：基于 MiniCPM4-8B，可自主生成可信的长篇综述论文。
MiniCPM4-MCP：基于 MiniCPM4-8B，可自主调用相关 MCP 工具。

简介

BitCPM4是基于MiniCPM系列模型通过量化感知训练（QAT）得到的三值量化模型，在训练效率和模型参数效率方面均实现了显著提升。

训练方法的改进
- 在小模型上通过风洞实验搜索超参数。
- 采用两阶段训练方法：先进行高精度训练，再进行QAT。
高参数效率
- 以仅1.58比特的位宽，实现了与相似参数量全精度模型相当的性能。

使用方法

使用Transformers进行推理

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

path = "openbmb/BitCPM4-0.5B"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

messages = [
    {"role": "user", "content": "推荐5个北京的景点。"},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)

model_outputs = model.generate(
    model_inputs,
    max_new_tokens=1024,
    top_p=0.7,
    temperature=0.7
)

output_token_ids = [
    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)

使用 vLLM-Ascend（华为昇腾 NPU）进行推理

from vllm import LLM, SamplingParams

llm = LLM(
    model="openbmb/BitCPM4-0.5B",  # or local path
    max_model_len=1024,
    gpu_memory_utilization=0.7,
    enforce_eager=True,
    trust_remote_code=True,
)

sampling_params = SamplingParams(max_tokens=128, temperature=0.7)
outputs = llm.generate(["What is artificial intelligence?"], sampling_params)
print(outputs[0].outputs[0].text)

评测结果（原始数据）

BitCPM4 的性能与同模型规模的其他全精度模型相当。 BitCPM 基准测试

Ascend NPU 适配与评测报告

本部分针对 BitCPM4-0.5B 在华为昇腾 Ascend 910B2 NPU 上通过 vLLM-Ascend 进行推理的适配验证与性能评估。

环境信息

项目	规格
硬件
NPU 型号	Ascend910_9362
NPU 数量	2
NPU 显存	61.3 GiB/卡
CPU	aarch64, 40 核
系统内存	229.4 GiB
软件
OS	Linux 5.10.0 (HCE 2.0)
Python	3.11.14
PyTorch	2.9.0+cpu
torch_npu	2.9.0.post1
vLLM	0.18.0
vLLM-Ascend	0.18.0rc1
CANN	8.5.1
transformers	4.57.6

模型配置

参数	值	说明
架构	`MiniCPMForCausalLM`	vLLM 原生支持
参数量	433.9M (0.5B)
hidden_size	1024
num_hidden_layers	24
num_attention_heads	16
num_key_value_heads	2	GQA (压缩比 8:1)
intermediate_size	4096
max_position_embeddings	32768	LongRoPE 缩放
vocab_size	73448
torch_dtype	bfloat16
权重精度	BF16
权重文件大小	867 MB	1 个 safetensors 分片

适配结论

✅ 零代码修改，完全适配成功

精度评测

评测方法

精度对比表

Prompt	CPU Top-1 Token	CPU Prob	NPU Top-1 Token	NPU Prob	Top-1 概率误差	Top-10 重合度
The capital of France is	`a`	0.3537	`a`	0.3524	0.38%	10/10 ✅
Einstein is known for	`his`	0.7679	`his`	0.7641	0.49%	10/10 ✅
Quantum computing is	`a`	0.7091	`a`	0.7051	0.57%	10/10 ✅
The meaning of life is	`a`	0.9718	`a`	0.9722	0.04%	10/10 ✅
Machine learning is a	`subset`	0.4118	`subset`	0.3963	3.76%	10/10 ✅
Python is a programming	`language`	0.9973	`language`	0.9970	0.03%	10/10 ✅
Natural language processing	`(`	0.6236	`(`	0.6271	0.57%	10/10 ✅

精度评测汇总

指标	值
Top-1 Token 匹配率	100% (7/7) ✅
Top-5 Token 包含率	100% (7/7) ✅
Top-10 Token 重合度	100% (平均 10.0/10) ✅
Top-1 概率平均相对误差	0.84% ✅ (< 1%)
Top-1 概率最大相对误差	3.76% (Prompt: "Machine learning is a")
Top-10 概率平均绝对误差 (MAE)	0.00114
Top-10 概率最大绝对误差	0.01550
Top-10 概率均方根误差 (RMSE)	0.00283

结论: NPU (bfloat16) 输出与 CPU (float32) 参考的 Top-1 预测完全一致，概率平均误差 0.84%，远低于 1% 的精度偏差阈值。极个别 prompt 的 top-1 概率误差达到 3.76%，但这是因为 CPU top-1 概率本身较低（subset 0.41 vs NPU 0.40），绝对差值仅 0.0155，不影响最终 token 选择。整体精度完全满足生产部署要求。

Top-10 概率分布对比 (CPU vs NPU)

Prompt: "The capital of France is"
Rank | CPU Token   | CPU Prob | NPU Token   | NPU Prob | Match
-----|-------------|----------|-------------|----------|------
1    | 'a'         | 0.353706 | 'a'         | 0.352359 | ✅
2    | 'Paris'     | 0.275060 | 'Paris'     | 0.274417 | ✅
3    | 'known'     | 0.052770 | 'known'     | 0.054036 | ✅
4    | 'the'       | 0.034749 | 'the'       | 0.032774 | ✅
5    | 'not'       | 0.021897 | 'not'       | 0.022526 | ✅
6    | 'named'     | 0.018752 | 'located'   | 0.018674 | 
7    | 'located'   | 0.018359 | 'named'     | 0.018674 | 
8    | 'widely'    | 0.014278 | 'widely'    | 0.014544 | ✅
9    | 'an'        | 0.013214 | 'an'        | 0.013662 | ✅
10   | 'held'      | 0.012054 | 'held'      | 0.012835 | ✅

性能评测

吞吐量测试 (Batch=10)

max_tokens	总耗时	Input 吞吐	Output 吞吐
32	1.07s	59.6 tok/s	298.2 tok/s
64	1.95s	32.8 tok/s	316.3 tok/s
128	3.81s	16.8 tok/s	336.4 tok/s

单请求延迟 (max_tokens=128)

指标	值
Input tokens	11
Output tokens	128
总耗时	3.67s
生成吞吐	34.8 tok/s

长上下文测试 (252 input tokens)

指标	值
Prefill 吞吐	~4273 tok/s
Decode 延迟	~59ms/token

显存使用

指标	值
模型加载时间	0.33s
模型权重显存	0.82 GiB
KV Cache 容量	41.78 GiB
最大并发 (1024 tokens/req)	3565×
Engine 初始化时间	3.67s
总 NPU 显存占用	43.29 GiB / 61.27 GiB

典型生成示例

示例 1

Input: "What is artificial intelligence?" Output:

Artificial intelligence (AI) is the simulation of human intelligence processes by computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction.

示例 2

Input: "请用中文简单介绍一下你自己。" Output:

我是 MiniCPM4，一个高效的语言模型，可以在各种设备上...

声明

作为语言模型，MiniCPM 通过学习海量文本生成内容。
但它不具备理解能力，也无法表达个人观点或价值判断。
MiniCPM 生成的任何内容均不代表模型开发者的观点或立场。
因此，用户在使用 MiniCPM 生成的内容时，应自行承担全部评估和验证责任。

许可证

本仓库及 MiniCPM 模型基于 Apache-2.0 许可证发布。

引用

如果您认为我们的工作有价值，请引用我们的论文。

@article{minicpm4,
  title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
  author={MiniCPM Team},
  year={2025}
}

最新动态

MiniCPM4 系列

简介

使用方法

使用Transformers进行推理

使用 vLLM-Ascend（华为昇腾 NPU）进行推理

评测结果（原始数据）

Ascend NPU 适配与评测报告

环境信息

模型配置

适配结论

精度评测

评测方法

精度对比表

精度评测汇总

Top-10 概率分布对比 (CPU vs NPU)

性能评测

吞吐量测试 (Batch=10)

单请求延迟 (max_tokens=128)

长上下文测试 (252 input tokens)

显存使用

典型生成示例

示例 1

示例 2

声明

许可证

引用

最新动态

MiniCPM4 系列

简介

使用方法

使用Transformers进行推理

使用 vLLM-Ascend（华为昇腾 NPU）进行推理

评测结果（原始数据）

Ascend NPU 适配与评测报告

环境信息

模型配置

适配结论

精度评测

评测方法

精度对比表

精度评测汇总

Top-10 概率分布对比 (CPU vs NPU)

性能评测

吞吐量测试 (Batch=10)

单请求延迟 (max_tokens=128)

长上下文测试 (252 input tokens)

显存使用

典型生成示例

示例 1

示例 2

声明

许可证

引用