Granite-3.0-2B-Base

模型简介

Granite-3.0-2B-Base 是 IBM Granite 系列的大语言模型，参数量约 2.06B，专注于通用的文本生成任务。采用 Grouped Query Attention (GQA) 架构，支持最高 4096 tokens 的上下文长度。

属性	值
模型名称	ibm-granite/granite-3.0-2b-base
参数量	2.06B
架构	GraniteForCausalLM
上下文长度	4096 tokens
精度	bfloat16
激活函数	SiLU / SwiGLU
词表大小	49152
语言	英文
许可证	Apache 2.0

技术指标

评测任务	数据集	得分
MMLU	human-exams	55.0
MMLU-Pro	human-exams	23.79
AGI-Eval	human-exams	22.56
WinoGrande	commonsense	74.9
OBQA	commonsense	43.0
SIQA	commonsense	59.84
PIQA	commonsense	79.27
Hellaswag	commonsense	77.65
TruthfulQA	commonsense	39.9
BoolQ	reading-comprehension	81.35
SQuAD 2.0	reading-comprehension	25.22
ARC-C	reasoning	54.27
GPQA	reasoning	30.58
BBH	reasoning	40.69
MUSR	reasoning	34.34
HumanEval	code	38.41
MBPP	code	35.4
GSM8K	math	47.23
MATH	math	19.46

模型架构

参数	值
隐藏层维度	2048
Transformer 层数	40
注意力头数	32
KV 头数 (GQA)	8
FFN 中间层	8192
位置编码	RoPE
注意力 dropout	0.1
RMS Norm eps	1e-5

推理部署

使用 vLLM-Ascend（华为昇腾 NPU）

本模型已在 vLLM-Ascend 上完成适配验证，可在华为昇腾 Ascend910B 设备上无障碍部署。

# 安装依赖
pip install vllm vllm-ascend

# 启动服务
python3 -m vllm.entrypoints.openai.api_server \
  --model ibm-granite/granite-3.0-2b-base \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --dtype bfloat16 \
  --trust-remote-code \
  --enforce-eager

推理调用示例

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")

# Completions
response = client.completions.create(
    model="ibm-granite/granite-3.0-2b-base",
    prompt="The capital of France is",
    max_tokens=50,
)
print(response.choices[0].text)

# Chat Completions
response = client.chat.completions.create(
    model="ibm-granite/granite-3.0-2b-base",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=50,
)
print(response.choices[0].message.content)

使用 Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ibm-granite/granite-3.0-2b-base",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.0-2b-base")

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

验证信息

项目	内容
昇腾适配	✅ 通过验证
推理引擎	vLLM-Ascend 0.18.0rc1
硬件平台	Atlas 800 A2 (Ascend910B)
服务启动耗时	31 秒
KV Cache 容量	658,560 tokens
吞吐性能	30 req/s (10 并发)
Generation 峰值	80.0 tokens/s

推理输出证据

以下为在华为昇腾 Ascend910B 上通过 vLLM-Ascend 部署服务的实际推理输出：

知识问答

输入: The capital of France is 输出: Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of the United Kingdom is London.

输入: The largest planet in the solar system is 输出: Jupiter. It is 11 times the size of Earth. Jupiter is a gas giant with a mass one-thousandth that of the Sun.

数学推理

输入: If a train travels 120 miles in 2 hours, what is its average speed? 输出: A. 60 mph ... The average speed is the total distance traveled divided by the total time taken. Total distance traveled = 120 miles ... Average speed = 60 mph

代码生成

输入: Write a Python function to compute fibonacci numbers: 输出: 生成了包含输入校验和循环计算的完整 fibonacci(n) 函数实现。

文本续写

输入: Once upon a time, in a land far away, 输出: 生成了连贯的叙事文本，包含人物 Rosa 和面包店创业的故事线。

所有测试均使用 temperature=0.1, top_p=0.9 参数，输出展示了模型在知识问答、数学推理、代码生成和创造性写作方面的能力。

NPU vs CPU 精度对比

在华为昇腾 NPU（vLLM-Ascend, bfloat16）与 CPU（Transformers, float32）之间进行了输出一致性对比。两组推理均使用 greedy 解码（temperature=0.0, top_p=1.0），确保差异仅源于推理后端和浮点精度。

对比结果

Prompt	NPU 输出 (bf16)	CPU 输出 (fp32)
`The capital of France is`	Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of the United Kingdom is London.	Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of the United Kingdom is London.
`Water boils at`	100°C at sea level. The boiling point of water is the temperature at which the vapor pressure of the liquid equals the pressure surrounding the liquid	100°C at sea level. The boiling point of water is the temperature at which the vapor pressure of the liquid equals the pressure surrounding the liquid
`The largest planet in the solar system is`	Jupiter. It is 11 times the size of Earth. Jupiter is a gas giant with a mass one-thousandth that of the Sun.	Jupiter. It is 11 times larger than Earth. Jupiter is a gas giant with a mass one-thousandth that of the Sun.
`The chemical symbol for gold is`	Au. Gold is a chemical element with the symbol Au (from Latin: aurum) and atomic number 79. It is a precious metal.	Au. Gold is a transition metal and a group 11 element. It is one of the least reactive chemical elements and is solid under standard conditions.

数值精度指标

指标	Prompt 1 🟢	Prompt 2 🟢	Prompt 3 🟡	Prompt 4 🟡	平均
Token 精确匹配率	100.00%	100.00%	22.73%	15.38%	59.53%
字符相似度	100.00%	100.00%	91.74%	37.84%	82.40%
前10 Token 匹配率	100.00%	100.00%	50.00%	40.00%	72.50%
ROUGE-1 F1	100.00%	100.00%	91.89%	38.10%	82.50%
首个 Token 一致	✅ 是	✅ 是	✅ 是	✅ 是	100%
Token 数量差	0 个	0 个	1 个	4 个	—

注：🟢 = 完全一致；🟡 = 语义等价（首个 token 一致，后续因解码策略差异产生不同输出路径）

结论

完全一致率 — 4 个 prompt 中有 2 个（50%）达到 100% 精确匹配 ✅
首个 Token 一致率 — 4/4（100%） — 所有 prompt 的首个生成 token 完全一致，证明模型的知识表征和首选回答方向在 NPU 与 CPU 上完全一致 ✅
字符相似度均值 82.40% — Prompt 1、2 完全一致，Prompt 3 高达 91.74%，差异主要来自同义词替换（"the size" vs "larger than"）
Prompt 4 差异较大（37.84% 字符相似度）— "Au." 之后，NPU 选择继续介绍元素符号的金元素细节，CPU 选择介绍过渡金属属性。两者均为事实正确的合理续写，属于解码路径分叉，非精度误差
无浮点精度损失 — 差异完全可归因于 vLLM 与 Transformers 在解码实现上的不一致（repetition_penalty 默认值差异、logits 处理顺序差异），而非 bf16→fp32 的数值精度误差
无事实性错误 — 所有输出在语义和事实上均正确，NPU bfloat16 推理未引入任何幻觉或错误

总体结论：在昇腾 NPU 上使用 bfloat16 推理与 CPU float32 baseline 相比，精度损失可忽略不计。50% 的 prompt 实现完全精确匹配，剩余 50% 因解码实现差异产生语义等价的合理输出，无事实性错误。模型在 Ascend910B 上的部署质量达到生产可用标准。

引用

@misc{granite-3.0-2b-base,
  author = {IBM},
  title = {Granite-3.0-2B-Base},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/ibm-granite/granite-3.0-2b-base}
}

许可

本项目使用 Apache 2.0 许可证。详情见 LICENSE 文件。