我

Gemma-3-270M on vLLM-Ascend 0.18.0rc1

1. 简介

本文档记录 google/gemma-3-270m-it 在 vLLM-Ascend 0.18.0rc1 环境的快速部署与验证结果。

Gemma 3 是 Google 推出的新一代轻量级开放模型系列，270M 参数版本适合边缘部署与快速验证场景。在昇腾 NPU 上，vLLM 0.18.0 已原生支持 Gemma3ForCausalLM 架构，无需额外模型适配代码即可直接部署。

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0+empty`
`transformers`	`4.57.6`
`torch-npu`	`2.9.0.post1+gitee7ba04`

NPU：1 逻辑卡 (Atlas 800 A2, NPU 910B4)
模型路径：/opt/atomgit/weights/gemma-3-270m-it
服务端口：8000

3. 服务启动

启动前可先检查端口：

ss -lntp | grep ':8000 ' || true

已验证通过的启动命令：

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1
export ASCEND_LOG_PATH=/opt/atomgit/ascend/log

vllm serve /opt/atomgit/weights/gemma-3-270m-it \
  --host 0.0.0.0 \
  --port 8000 \
  --data-parallel-size 1 \
  --tensor-parallel-size 1 \
  --seed 1024 \
  --served-model-name gemma-3-270m-it \
  --max-num-seqs 32 \
  --max-model-len 32768 \
  --max-num-batched-tokens 4096 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --additional-config '{"enable_cpu_binding":true}'

参数说明：

参数	值	说明
`--tensor-parallel-size`	1	Tensor 并行大小，270M 单卡即可
`--data-parallel-size`	1	Data 并行大小
`--max-model-len`	32768	最大上下文长度 (32k)
`--max-num-seqs`	32	每 DP 组最大并发请求数
`--max-num-batched-tokens`	4096	单步最大处理 tokens 数
`--gpu-memory-utilization`	0.90	HBM 利用率
`--compilation-config`	FULL_DECODE_ONLY	图编译模式

4. Smoke 验证

基础检查：

curl -sf http://127.0.0.1:8000/v1/models
curl -sf http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-3-270m-it",
    "messages": [
      {"role": "user", "content": "What is the capital of France? Answer in one word."}
    ],
    "temperature": 0,
    "max_tokens": 16
  }'

验证结果：

/v1/models 返回 200
/v1/chat/completions 返回 200
模型可正常生成文本，推理链路端到端打通

5. 性能参考

5.1 离线吞吐 (vllm bench throughput)

测试条件：16 prompts, input=200, output=200

指标	数值
`Throughput`	`0.99 requests/s`
`Total tokens/s`	`1135.51 tok/s`
`Output tokens/s`	`126.17 tok/s`

5.2 在线吞吐 (自定义压测)

测试条件：100 prompts, concurrency=8, input≈200, output≈200

指标	数值
`duration`	`293.38 s`
`request_throughput`	`0.34 req/s`
`output_token_throughput`	`68.17 tok/s`
`total_token_throughput`	`82.15 tok/s`
`mean_latency_ms`	`22553.97 ms`

5.3 延迟 (自定义单请求测试)

测试条件：batch=1, input≈200, output≈200, 10 iterations

指标	数值
`mean_total_ms`	`22624.23 ms`
`median_total_ms`	`22661.48 ms`
`p99_total_ms`	`23253.73 ms`
`mean_ttft_ms`	`6787.27 ms`
`mean_tpot_ms`	`79.18 ms`

注：270M 轻量模型在 NPU 上的绝对性能低于大模型，以上数据主要用于验证部署可行性。

6. 精度评测

6.1 轻量级一致性检查

鉴于 270M 参数规模较小，未在 GSM8K 等复杂 benchmark 上评估，而是执行轻量级一致性检查：

测试项	结果
一致性测试 (temperature=0, 3 轮)	输出完全一致
基础推理 (算术/事实/逻辑)	模型可响应，内容质量受参数量限制
中文测试	模型可响应，内容质量受参数量限制

评测结论：模型在昇腾 NPU 上推理一致性良好，部署链路完整。由于参数量限制，生成内容质量有限，这是预期行为。

6.2 NPU/CPU 数值一致性验证

对比 NPU 推理输出与 CPU 基线（float32）在 logits 和 hidden states 层面的数值差异：

指标	Logits	Hidden States
max_abs_error	0.000109	0.000710
mean_abs_error	0.000014	0.000006
relative_error	0.0009%	0.0039%
cosine_similarity	1.000000	1.000000
threshold	1.0%	1.0%
结果	PASS	PASS

结论：NPU 与 CPU 基线高度一致，cosine_similarity = 1.0，relative_error < 0.004%，验证通过。

7. 推理脚本

提供 inference.py，支持 API 模式与离线模式：

# API 模式（需先启动 vllm serve）
python inference.py --mode api --prompt "Hello, how are you?"

# 离线模式（直接加载模型）
python inference.py --mode offline \
  --model /opt/atomgit/weights/gemma-3-270m-it \
  --prompt "Hello, how are you?"

# 交互式对话
python inference.py --mode api --interactive

8. 注意事项

设备编号：本环境 NPU 物理编号为 5（npu-smi info 显示），但实际部署时无需显式设置 ASCEND_RT_VISIBLE_DEVICES，vLLM-Ascend 会自动发现可用设备。
日志目录：如遇 can not create directory, directory: /home/atomgit/ascend/log 警告，可设置 export ASCEND_LOG_PATH=/opt/atomgit/ascend/log（或任意可写路径）。
模型能力：270M 参数模型主要用于验证部署链路，复杂推理任务建议选用更大参数的 Gemma 3 版本（如 4B/12B/27B）。
图编译：首次启动需 ACL Graph warmup，耗时约 30 秒，后续请求速度稳定。

9. 评测材料

eval/accuracy_eval.py — 轻量级精度评测源码
eval/perf_eval.py — 性能评测源码
perf_results/ — 运行日志与结果报告