starcoder2-3b 在 vLLM-Ascend #+NPU 上的部署

1. 简介

本文档记录 bigcode/starcoder2-3b 模型在 vLLM-Ascend 环境下的快速部署与验证结果。

Starcoder2-3b 是 BigCode 开发的 30 亿参数代码生成模型，基于 Starcoder2 架构，支持多种编程语言的代码补全和生成任务。

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0`
`transformers`	`4.57.6`
`torch-npu`	`2.9.0`
`CANN`	`8.5.1`
`SOC`	`ascend910_9391`

NPU：1 逻辑卡（Ascend 910B2，64GB HBM）
模型路径：/home/openmind/.cache/modelscope/hub/models/bigcode/starcoder2-3b

3. 模型配置

参数	值
架构	Starcoder2ForCausalLM
参数量	3B
层数	30
隐藏维度	3072
注意力头数	24
KV 头数	2 (GQA)
词表大小	49152
精度	Float16（vLLM 推理）

4. 模型下载

# 从 ModelScope 下载
modelscope download --model bigcode/starcoder2-3b

5. 基础推理验证

使用 vLLM-Ascend 加载模型进行推理：

from vllm import LLM, SamplingParams

MODEL_PATH = "bigcode/starcoder2-3b"

llm = LLM(
    model=MODEL_PATH,
    trust_remote_code=True,
    dtype="float16",
    tensor_parallel_size=1,
    max_model_len=2048,
    gpu_memory_utilization=0.9,
)

sampling = SamplingParams(max_tokens=64, temperature=0)
outputs = llm.generate(["The capital of France is"], sampling)
print(outputs[0].outputs[0].text)
# Paris.

6. Smoke 验证

6.1 冒烟测试脚本

运行 4 项冒烟测试验证模型功能：

python3 smoke_test.py

6.2 验证结果

=== starcoder2-3b vLLM-Ascend Smoke Test ===
[1/4] Testing basic generation...
  PASS: Paris.
[2/4] Testing factual knowledge...
  PASS: 4
[3/4] Testing batch generation...
  [1] Jupiter, which is 318 times the mass of Earth...
  [2] print("Hello, World!")...
  [3] The sky is blue because...
  PASS: 3 prompts processed
[4/4] Testing code generation...
  PASS: if n == 0:
=== All 4/4 tests PASSED (78.1s) ===

验证项说明：

测试项	输入	预期	结果
英文基础生成	The capital of France is	包含 Paris	PASS
事实知识	What is 2 + 2?	包含 4	PASS
批量生成	3 个 prompt	3 条非空输出	PASS
代码生成	def fibonacci(n):	非空代码续写	PASS

7. 精度评测

7.1 评测配置

数据集：ARC-Challenge（25-shot）
数据来源：ModelScope modelscope/ai2_arc
评测框架：vLLM-Ascend offline inference
设备：NPU (npu:0)

python3 eval_arc_vllm.py --output logs/arc_challenge_vllm.json

7.2 评测结果

指标	基线	NPU 结果	差异
ARC-Challenge 25-shot	39.10%	38.57% (452/1172)	-0.53%

ARC-Challenge NPU 准确率为 38.57%，与基线 39.10% 的差异为 -0.53%（< 1%），验证通过。

7.3 评测日志

=== starcoder2-3b ARC-Challenge 25-shot vLLM-Ascend ===
Total test samples: 1172
Loading model with vLLM-Ascend...
Model loaded.
Generating (1172 samples)...
============================================================
starcoder2-3b ARC-Challenge 25-shot Results (vLLM-Ascend)
============================================================
  Accuracy: 38.57% (452/1172)
  Total time: 8s
  Throughput: 139.07 samples/s
============================================================
Results saved to logs/arc_challenge_vllm.json

8. 性能测试

8.1 测试脚本

python3 benchmark.py

8.2 测试结果

场景	输出 tokens	吞吐量	TPOT
单请求	128	22.27 tok/s	44.9 ms/tok
批量 (8 并发)	512	170.95 tok/s	—

8.3 测试日志

=== starcoder2-3b vLLM-Ascend Benchmark ===
--- Single prompt (128 tokens) ---
  Run 1: 128 tokens, 5.85s, 21.86 tok/s, TPOT 45.7 ms/tok
  Run 2: 128 tokens, 5.88s, 21.75 tok/s, TPOT 46.0 ms/tok
  Run 3: 128 tokens, 5.71s, 22.43 tok/s, TPOT 44.6 ms/tok
  Run 4: 128 tokens, 5.67s, 22.59 tok/s, TPOT 44.3 ms/tok
  Run 5: 128 tokens, 5.64s, 22.71 tok/s, TPOT 44.0 ms/tok
  Avg: 22.27 tok/s, TPOT 44.9 ms/tok
--- Batch (8 prompts, 64 tokens each) ---
  Run 1: 512 tokens, 3.07s, 166.59 tok/s
  Run 2: 512 tokens, 2.93s, 174.66 tok/s
  Run 3: 512 tokens, 2.98s, 171.60 tok/s
  Avg: 170.95 tok/s
==================================================
  Single prompt: 22.27 tok/s, TPOT 44.9 ms/tok
  Batch (8):     170.95 tok/s
==================================================

9. 注意事项

代码模型：此为代码生成模型，主要适用于编程任务，自然语言能力相对有限。
显存占用：float16 推理时模型约占用 6 GB HBM，单卡即可运行。
GQA 架构：模型使用 Grouped Query Attention（KV 头数 2），推理效率较高。
推理速度：单请求 22.27 tok/s，8 并发批量 170.95 tok/s。

starcoder2-3b 在 vLLM-Ascend #+NPU 上的部署

1. 简介

本文档记录 bigcode/starcoder2-3b 模型在 vLLM-Ascend 环境下的快速部署与验证结果。

Starcoder2-3b 是 BigCode 开发的 30 亿参数代码生成模型，基于 Starcoder2 架构，支持多种编程语言的代码补全和生成任务。

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0`
`transformers`	`4.57.6`
`torch-npu`	`2.9.0`
`CANN`	`8.5.1`
`SOC`	`ascend910_9391`

NPU：1 逻辑卡（Ascend 910B2，64GB HBM）
模型路径：/home/openmind/.cache/modelscope/hub/models/bigcode/starcoder2-3b

3. 模型配置

参数	值
架构	Starcoder2ForCausalLM
参数量	3B
层数	30
隐藏维度	3072
注意力头数	24
KV 头数	2 (GQA)
词表大小	49152
精度	Float16（vLLM 推理）

4. 模型下载

# 从 ModelScope 下载
modelscope download --model bigcode/starcoder2-3b

5. 基础推理验证

使用 vLLM-Ascend 加载模型进行推理：

from vllm import LLM, SamplingParams

MODEL_PATH = "bigcode/starcoder2-3b"

llm = LLM(
    model=MODEL_PATH,
    trust_remote_code=True,
    dtype="float16",
    tensor_parallel_size=1,
    max_model_len=2048,
    gpu_memory_utilization=0.9,
)

sampling = SamplingParams(max_tokens=64, temperature=0)
outputs = llm.generate(["The capital of France is"], sampling)
print(outputs[0].outputs[0].text)
# Paris.

6. Smoke 验证

6.1 冒烟测试脚本

运行 4 项冒烟测试验证模型功能：

python3 smoke_test.py

6.2 验证结果

=== starcoder2-3b vLLM-Ascend Smoke Test ===
[1/4] Testing basic generation...
  PASS: Paris.
[2/4] Testing factual knowledge...
  PASS: 4
[3/4] Testing batch generation...
  [1] Jupiter, which is 318 times the mass of Earth...
  [2] print("Hello, World!")...
  [3] The sky is blue because...
  PASS: 3 prompts processed
[4/4] Testing code generation...
  PASS: if n == 0:
=== All 4/4 tests PASSED (78.1s) ===

验证项说明：

测试项	输入	预期	结果
英文基础生成	The capital of France is	包含 Paris	PASS
事实知识	What is 2 + 2?	包含 4	PASS
批量生成	3 个 prompt	3 条非空输出	PASS
代码生成	def fibonacci(n):	非空代码续写	PASS

7. 精度评测

7.1 评测配置

数据集：ARC-Challenge（25-shot）
数据来源：ModelScope modelscope/ai2_arc
评测框架：vLLM-Ascend offline inference
设备：NPU (npu:0)

python3 eval_arc_vllm.py --output logs/arc_challenge_vllm.json

7.2 评测结果

指标	基线	NPU 结果	差异
ARC-Challenge 25-shot	39.10%	38.57% (452/1172)	-0.53%

ARC-Challenge NPU 准确率为 38.57%，与基线 39.10% 的差异为 -0.53%（< 1%），验证通过。

7.3 评测日志

=== starcoder2-3b ARC-Challenge 25-shot vLLM-Ascend ===
Total test samples: 1172
Loading model with vLLM-Ascend...
Model loaded.
Generating (1172 samples)...
============================================================
starcoder2-3b ARC-Challenge 25-shot Results (vLLM-Ascend)
============================================================
  Accuracy: 38.57% (452/1172)
  Total time: 8s
  Throughput: 139.07 samples/s
============================================================
Results saved to logs/arc_challenge_vllm.json

8. 性能测试

8.1 测试脚本

python3 benchmark.py

8.2 测试结果

场景	输出 tokens	吞吐量	TPOT
单请求	128	22.27 tok/s	44.9 ms/tok
批量 (8 并发)	512	170.95 tok/s	—

8.3 测试日志

=== starcoder2-3b vLLM-Ascend Benchmark ===
--- Single prompt (128 tokens) ---
  Run 1: 128 tokens, 5.85s, 21.86 tok/s, TPOT 45.7 ms/tok
  Run 2: 128 tokens, 5.88s, 21.75 tok/s, TPOT 46.0 ms/tok
  Run 3: 128 tokens, 5.71s, 22.43 tok/s, TPOT 44.6 ms/tok
  Run 4: 128 tokens, 5.67s, 22.59 tok/s, TPOT 44.3 ms/tok
  Run 5: 128 tokens, 5.64s, 22.71 tok/s, TPOT 44.0 ms/tok
  Avg: 22.27 tok/s, TPOT 44.9 ms/tok
--- Batch (8 prompts, 64 tokens each) ---
  Run 1: 512 tokens, 3.07s, 166.59 tok/s
  Run 2: 512 tokens, 2.93s, 174.66 tok/s
  Run 3: 512 tokens, 2.98s, 171.60 tok/s
  Avg: 170.95 tok/s
==================================================
  Single prompt: 22.27 tok/s, TPOT 44.9 ms/tok
  Batch (8):     170.95 tok/s
==================================================

9. 注意事项

代码模型：此为代码生成模型，主要适用于编程任务，自然语言能力相对有限。
显存占用：float16 推理时模型约占用 6 GB HBM，单卡即可运行。
GQA 架构：模型使用 Grouped Query Attention（KV 头数 2），推理效率较高。
推理速度：单请求 22.27 tok/s，8 并发批量 170.95 tok/s。