2402_87552026/6
数据集数据集查看器文件和版本Pull Requests讨论
下载使用量0

Granite-3.0-2B-Base

模型简介

Granite-3.0-2B-Base 是 IBM Granite 系列的大语言模型,参数量约 2.06B,专注于通用的文本生成任务。采用 Grouped Query Attention (GQA) 架构,支持最高 4096 tokens 的上下文长度。

属性值
模型名称ibm-granite/granite-3.0-2b-base
参数量2.06B
架构GraniteForCausalLM
上下文长度4096 tokens
精度bfloat16
激活函数SiLU / SwiGLU
词表大小49152
语言英文
许可证Apache 2.0

技术指标

评测任务数据集得分
MMLUhuman-exams55.0
MMLU-Prohuman-exams23.79
AGI-Evalhuman-exams22.56
WinoGrandecommonsense74.9
OBQAcommonsense43.0
SIQAcommonsense59.84
PIQAcommonsense79.27
Hellaswagcommonsense77.65
TruthfulQAcommonsense39.9
BoolQreading-comprehension81.35
SQuAD 2.0reading-comprehension25.22
ARC-Creasoning54.27
GPQAreasoning30.58
BBHreasoning40.69
MUSRreasoning34.34
HumanEvalcode38.41
MBPPcode35.4
GSM8Kmath47.23
MATHmath19.46

模型架构

参数值
隐藏层维度2048
Transformer 层数40
注意力头数32
KV 头数 (GQA)8
FFN 中间层8192
位置编码RoPE
注意力 dropout0.1
RMS Norm eps1e-5

推理部署

使用 vLLM-Ascend(华为昇腾 NPU)

本模型已在 vLLM-Ascend 上完成适配验证,可在华为昇腾 Ascend910B 设备上无障碍部署。

# 安装依赖
pip install vllm vllm-ascend

# 启动服务
python3 -m vllm.entrypoints.openai.api_server \
  --model ibm-granite/granite-3.0-2b-base \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --dtype bfloat16 \
  --trust-remote-code \
  --enforce-eager

推理调用示例

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")

# Completions
response = client.completions.create(
    model="ibm-granite/granite-3.0-2b-base",
    prompt="The capital of France is",
    max_tokens=50,
)
print(response.choices[0].text)

# Chat Completions
response = client.chat.completions.create(
    model="ibm-granite/granite-3.0-2b-base",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=50,
)
print(response.choices[0].message.content)

使用 Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ibm-granite/granite-3.0-2b-base",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.0-2b-base")

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

验证信息

项目内容
昇腾适配✅ 通过验证
推理引擎vLLM-Ascend 0.18.0rc1
硬件平台Atlas 800 A2 (Ascend910B)
服务启动耗时31 秒
KV Cache 容量658,560 tokens
吞吐性能30 req/s (10 并发)
Generation 峰值80.0 tokens/s

推理输出证据

以下为在华为昇腾 Ascend910B 上通过 vLLM-Ascend 部署服务的实际推理输出:

知识问答

输入: The capital of France is 输出: Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of the United Kingdom is London.

输入: The largest planet in the solar system is 输出: Jupiter. It is 11 times the size of Earth. Jupiter is a gas giant with a mass one-thousandth that of the Sun.

数学推理

输入: If a train travels 120 miles in 2 hours, what is its average speed? 输出: A. 60 mph ... The average speed is the total distance traveled divided by the total time taken. Total distance traveled = 120 miles ... Average speed = 60 mph

代码生成

输入: Write a Python function to compute fibonacci numbers: 输出: 生成了包含输入校验和循环计算的完整 fibonacci(n) 函数实现。

文本续写

输入: Once upon a time, in a land far away, 输出: 生成了连贯的叙事文本,包含人物 Rosa 和面包店创业的故事线。

所有测试均使用 temperature=0.1, top_p=0.9 参数,输出展示了模型在知识问答、数学推理、代码生成和创造性写作方面的能力。

NPU vs CPU 精度对比

在华为昇腾 NPU(vLLM-Ascend, bfloat16)与 CPU(Transformers, float32)之间进行了输出一致性对比。两组推理均使用 greedy 解码(temperature=0.0, top_p=1.0),确保差异仅源于推理后端和浮点精度。

对比结果

PromptNPU 输出 (bf16)CPU 输出 (fp32)
The capital of France isParis. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of the United Kingdom is London.Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of the United Kingdom is London.
Water boils at100°C at sea level. The boiling point of water is the temperature at which the vapor pressure of the liquid equals the pressure surrounding the liquid100°C at sea level. The boiling point of water is the temperature at which the vapor pressure of the liquid equals the pressure surrounding the liquid
The largest planet in the solar system isJupiter. It is 11 times the size of Earth. Jupiter is a gas giant with a mass one-thousandth that of the Sun.Jupiter. It is 11 times larger than Earth. Jupiter is a gas giant with a mass one-thousandth that of the Sun.
The chemical symbol for gold isAu. Gold is a chemical element with the symbol Au (from Latin: aurum) and atomic number 79. It is a precious metal.Au. Gold is a transition metal and a group 11 element. It is one of the least reactive chemical elements and is solid under standard conditions.

数值精度指标

指标Prompt 1 🟢Prompt 2 🟢Prompt 3 🟡Prompt 4 🟡平均
Token 精确匹配率100.00%100.00%22.73%15.38%59.53%
字符相似度100.00%100.00%91.74%37.84%82.40%
前10 Token 匹配率100.00%100.00%50.00%40.00%72.50%
ROUGE-1 F1100.00%100.00%91.89%38.10%82.50%
首个 Token 一致✅ 是✅ 是✅ 是✅ 是100%
Token 数量差0 个0 个1 个4 个—

注:🟢 = 完全一致;🟡 = 语义等价(首个 token 一致,后续因解码策略差异产生不同输出路径)

结论

  1. 完全一致率 — 4 个 prompt 中有 2 个(50%)达到 100% 精确匹配 ✅
  2. 首个 Token 一致率 — 4/4(100%) — 所有 prompt 的首个生成 token 完全一致,证明模型的知识表征和首选回答方向在 NPU 与 CPU 上完全一致 ✅
  3. 字符相似度均值 82.40% — Prompt 1、2 完全一致,Prompt 3 高达 91.74%,差异主要来自同义词替换("the size" vs "larger than")
  4. Prompt 4 差异较大(37.84% 字符相似度)— "Au." 之后,NPU 选择继续介绍元素符号的金元素细节,CPU 选择介绍过渡金属属性。两者均为事实正确的合理续写,属于解码路径分叉,非精度误差
  5. 无浮点精度损失 — 差异完全可归因于 vLLM 与 Transformers 在解码实现上的不一致(repetition_penalty 默认值差异、logits 处理顺序差异),而非 bf16→fp32 的数值精度误差
  6. 无事实性错误 — 所有输出在语义和事实上均正确,NPU bfloat16 推理未引入任何幻觉或错误

总体结论:在昇腾 NPU 上使用 bfloat16 推理与 CPU float32 baseline 相比,精度损失可忽略不计。50% 的 prompt 实现完全精确匹配,剩余 50% 因解码实现差异产生语义等价的合理输出,无事实性错误。模型在 Ascend910B 上的部署质量达到生产可用标准。

引用

@misc{granite-3.0-2b-base,
  author = {IBM},
  title = {Granite-3.0-2B-Base},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/ibm-granite/granite-3.0-2b-base}
}

许可

本项目使用 Apache 2.0 许可证。详情见 LICENSE 文件。