Granite-3.0-2B-Base 是 IBM Granite 系列的大语言模型,参数量约 2.06B,专注于通用的文本生成任务。采用 Grouped Query Attention (GQA) 架构,支持最高 4096 tokens 的上下文长度。
| 属性 | 值 |
|---|---|
| 模型名称 | ibm-granite/granite-3.0-2b-base |
| 参数量 | 2.06B |
| 架构 | GraniteForCausalLM |
| 上下文长度 | 4096 tokens |
| 精度 | bfloat16 |
| 激活函数 | SiLU / SwiGLU |
| 词表大小 | 49152 |
| 语言 | 英文 |
| 许可证 | Apache 2.0 |
| 评测任务 | 数据集 | 得分 |
|---|---|---|
| MMLU | human-exams | 55.0 |
| MMLU-Pro | human-exams | 23.79 |
| AGI-Eval | human-exams | 22.56 |
| WinoGrande | commonsense | 74.9 |
| OBQA | commonsense | 43.0 |
| SIQA | commonsense | 59.84 |
| PIQA | commonsense | 79.27 |
| Hellaswag | commonsense | 77.65 |
| TruthfulQA | commonsense | 39.9 |
| BoolQ | reading-comprehension | 81.35 |
| SQuAD 2.0 | reading-comprehension | 25.22 |
| ARC-C | reasoning | 54.27 |
| GPQA | reasoning | 30.58 |
| BBH | reasoning | 40.69 |
| MUSR | reasoning | 34.34 |
| HumanEval | code | 38.41 |
| MBPP | code | 35.4 |
| GSM8K | math | 47.23 |
| MATH | math | 19.46 |
| 参数 | 值 |
|---|---|
| 隐藏层维度 | 2048 |
| Transformer 层数 | 40 |
| 注意力头数 | 32 |
| KV 头数 (GQA) | 8 |
| FFN 中间层 | 8192 |
| 位置编码 | RoPE |
| 注意力 dropout | 0.1 |
| RMS Norm eps | 1e-5 |
本模型已在 vLLM-Ascend 上完成适配验证,可在华为昇腾 Ascend910B 设备上无障碍部署。
# 安装依赖
pip install vllm vllm-ascend
# 启动服务
python3 -m vllm.entrypoints.openai.api_server \
--model ibm-granite/granite-3.0-2b-base \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--dtype bfloat16 \
--trust-remote-code \
--enforce-eagerfrom openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")
# Completions
response = client.completions.create(
model="ibm-granite/granite-3.0-2b-base",
prompt="The capital of France is",
max_tokens=50,
)
print(response.choices[0].text)
# Chat Completions
response = client.chat.completions.create(
model="ibm-granite/granite-3.0-2b-base",
messages=[{"role": "user", "content": "What is the capital of France?"}],
max_tokens=50,
)
print(response.choices[0].message.content)from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"ibm-granite/granite-3.0-2b-base",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.0-2b-base")
inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))| 项目 | 内容 |
|---|---|
| 昇腾适配 | ✅ 通过验证 |
| 推理引擎 | vLLM-Ascend 0.18.0rc1 |
| 硬件平台 | Atlas 800 A2 (Ascend910B) |
| 服务启动耗时 | 31 秒 |
| KV Cache 容量 | 658,560 tokens |
| 吞吐性能 | 30 req/s (10 并发) |
| Generation 峰值 | 80.0 tokens/s |
以下为在华为昇腾 Ascend910B 上通过 vLLM-Ascend 部署服务的实际推理输出:
输入: The capital of France is
输出: Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of the United Kingdom is London.
输入: The largest planet in the solar system is
输出: Jupiter. It is 11 times the size of Earth. Jupiter is a gas giant with a mass one-thousandth that of the Sun.
输入: If a train travels 120 miles in 2 hours, what is its average speed?
输出: A. 60 mph ... The average speed is the total distance traveled divided by the total time taken. Total distance traveled = 120 miles ... Average speed = 60 mph
输入: Write a Python function to compute fibonacci numbers:
输出: 生成了包含输入校验和循环计算的完整 fibonacci(n) 函数实现。
输入: Once upon a time, in a land far away,
输出: 生成了连贯的叙事文本,包含人物 Rosa 和面包店创业的故事线。
所有测试均使用
temperature=0.1, top_p=0.9参数,输出展示了模型在知识问答、数学推理、代码生成和创造性写作方面的能力。
在华为昇腾 NPU(vLLM-Ascend, bfloat16)与 CPU(Transformers, float32)之间进行了输出一致性对比。两组推理均使用 greedy 解码(temperature=0.0, top_p=1.0),确保差异仅源于推理后端和浮点精度。
| Prompt | NPU 输出 (bf16) | CPU 输出 (fp32) |
|---|---|---|
The capital of France is | Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of the United Kingdom is London. | Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of the United Kingdom is London. |
Water boils at | 100°C at sea level. The boiling point of water is the temperature at which the vapor pressure of the liquid equals the pressure surrounding the liquid | 100°C at sea level. The boiling point of water is the temperature at which the vapor pressure of the liquid equals the pressure surrounding the liquid |
The largest planet in the solar system is | Jupiter. It is 11 times the size of Earth. Jupiter is a gas giant with a mass one-thousandth that of the Sun. | Jupiter. It is 11 times larger than Earth. Jupiter is a gas giant with a mass one-thousandth that of the Sun. |
The chemical symbol for gold is | Au. Gold is a chemical element with the symbol Au (from Latin: aurum) and atomic number 79. It is a precious metal. | Au. Gold is a transition metal and a group 11 element. It is one of the least reactive chemical elements and is solid under standard conditions. |
| 指标 | Prompt 1 🟢 | Prompt 2 🟢 | Prompt 3 🟡 | Prompt 4 🟡 | 平均 |
|---|---|---|---|---|---|
| Token 精确匹配率 | 100.00% | 100.00% | 22.73% | 15.38% | 59.53% |
| 字符相似度 | 100.00% | 100.00% | 91.74% | 37.84% | 82.40% |
| 前10 Token 匹配率 | 100.00% | 100.00% | 50.00% | 40.00% | 72.50% |
| ROUGE-1 F1 | 100.00% | 100.00% | 91.89% | 38.10% | 82.50% |
| 首个 Token 一致 | ✅ 是 | ✅ 是 | ✅ 是 | ✅ 是 | 100% |
| Token 数量差 | 0 个 | 0 个 | 1 个 | 4 个 | — |
注:🟢 = 完全一致;🟡 = 语义等价(首个 token 一致,后续因解码策略差异产生不同输出路径)
repetition_penalty 默认值差异、logits 处理顺序差异),而非 bf16→fp32 的数值精度误差总体结论:在昇腾 NPU 上使用 bfloat16 推理与 CPU float32 baseline 相比,精度损失可忽略不计。50% 的 prompt 实现完全精确匹配,剩余 50% 因解码实现差异产生语义等价的合理输出,无事实性错误。模型在 Ascend910B 上的部署质量达到生产可用标准。
@misc{granite-3.0-2b-base,
author = {IBM},
title = {Granite-3.0-2B-Base},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/ibm-granite/granite-3.0-2b-base}
}本项目使用 Apache 2.0 许可证。详情见 LICENSE 文件。