本文档记录 bigcode/starcoder2-3b 模型在 vLLM-Ascend 环境下的快速部署与验证结果。
Starcoder2-3b 是 BigCode 开发的 30 亿参数代码生成模型,基于 Starcoder2 架构,支持多种编程语言的代码补全和生成任务。
相关获取地址:
| 组件 | 版本 |
|---|---|
vllm-ascend | 0.18.0rc1 |
vllm | 0.18.0 |
transformers | 4.57.6 |
torch-npu | 2.9.0 |
CANN | 8.5.1 |
SOC | ascend910_9391 |
1 逻辑卡(Ascend 910B2,64GB HBM)/home/openmind/.cache/modelscope/hub/models/bigcode/starcoder2-3b| 参数 | 值 |
|---|---|
| 架构 | Starcoder2ForCausalLM |
| 参数量 | 3B |
| 层数 | 30 |
| 隐藏维度 | 3072 |
| 注意力头数 | 24 |
| KV 头数 | 2 (GQA) |
| 词表大小 | 49152 |
| 精度 | Float16(vLLM 推理) |
# 从 ModelScope 下载
modelscope download --model bigcode/starcoder2-3b使用 vLLM-Ascend 加载模型进行推理:
from vllm import LLM, SamplingParams
MODEL_PATH = "bigcode/starcoder2-3b"
llm = LLM(
model=MODEL_PATH,
trust_remote_code=True,
dtype="float16",
tensor_parallel_size=1,
max_model_len=2048,
gpu_memory_utilization=0.9,
)
sampling = SamplingParams(max_tokens=64, temperature=0)
outputs = llm.generate(["The capital of France is"], sampling)
print(outputs[0].outputs[0].text)
# Paris.运行 4 项冒烟测试验证模型功能:
python3 smoke_test.py=== starcoder2-3b vLLM-Ascend Smoke Test ===
[1/4] Testing basic generation...
PASS: Paris.
[2/4] Testing factual knowledge...
PASS: 4
[3/4] Testing batch generation...
[1] Jupiter, which is 318 times the mass of Earth...
[2] print("Hello, World!")...
[3] The sky is blue because...
PASS: 3 prompts processed
[4/4] Testing code generation...
PASS: if n == 0:
=== All 4/4 tests PASSED (78.1s) ===验证项说明:
| 测试项 | 输入 | 预期 | 结果 |
|---|---|---|---|
| 英文基础生成 | The capital of France is | 包含 Paris | PASS |
| 事实知识 | What is 2 + 2? | 包含 4 | PASS |
| 批量生成 | 3 个 prompt | 3 条非空输出 | PASS |
| 代码生成 | def fibonacci(n): | 非空代码续写 | PASS |
modelscope/ai2_arcpython3 eval_arc_vllm.py --output logs/arc_challenge_vllm.json| 指标 | 基线 | NPU 结果 | 差异 |
|---|---|---|---|
| ARC-Challenge 25-shot | 39.10% | 38.57% (452/1172) | -0.53% |
ARC-Challenge NPU 准确率为 38.57%,与基线 39.10% 的差异为 -0.53%(< 1%),验证通过。
=== starcoder2-3b ARC-Challenge 25-shot vLLM-Ascend ===
Total test samples: 1172
Loading model with vLLM-Ascend...
Model loaded.
Generating (1172 samples)...
============================================================
starcoder2-3b ARC-Challenge 25-shot Results (vLLM-Ascend)
============================================================
Accuracy: 38.57% (452/1172)
Total time: 8s
Throughput: 139.07 samples/s
============================================================
Results saved to logs/arc_challenge_vllm.jsonpython3 benchmark.py| 场景 | 输出 tokens | 吞吐量 | TPOT |
|---|---|---|---|
| 单请求 | 128 | 22.27 tok/s | 44.9 ms/tok |
| 批量 (8 并发) | 512 | 170.95 tok/s | — |
=== starcoder2-3b vLLM-Ascend Benchmark ===
--- Single prompt (128 tokens) ---
Run 1: 128 tokens, 5.85s, 21.86 tok/s, TPOT 45.7 ms/tok
Run 2: 128 tokens, 5.88s, 21.75 tok/s, TPOT 46.0 ms/tok
Run 3: 128 tokens, 5.71s, 22.43 tok/s, TPOT 44.6 ms/tok
Run 4: 128 tokens, 5.67s, 22.59 tok/s, TPOT 44.3 ms/tok
Run 5: 128 tokens, 5.64s, 22.71 tok/s, TPOT 44.0 ms/tok
Avg: 22.27 tok/s, TPOT 44.9 ms/tok
--- Batch (8 prompts, 64 tokens each) ---
Run 1: 512 tokens, 3.07s, 166.59 tok/s
Run 2: 512 tokens, 2.93s, 174.66 tok/s
Run 3: 512 tokens, 2.98s, 171.60 tok/s
Avg: 170.95 tok/s
==================================================
Single prompt: 22.27 tok/s, TPOT 44.9 ms/tok
Batch (8): 170.95 tok/s
==================================================代码模型:此为代码生成模型,主要适用于编程任务,自然语言能力相对有限。
显存占用:float16 推理时模型约占用 6 GB HBM,单卡即可运行。
GQA 架构:模型使用 Grouped Query Attention(KV 头数 2),推理效率较高。
推理速度:单请求 22.27 tok/s,8 并发批量 170.95 tok/s。