Llama-3.2-1B on vLLM-Ascend 0.18.0rc1 #+NPU

1. 简介

本文档记录 LLM-Research/Llama-3.2-1B（预训练基础版，1.23B 参数）在 vLLM-Ascend 0.18.0rc1 环境的快速部署与验证结果。

Llama 3.2 1B 是 Meta 发布的超轻量级多语言大语言模型，使用优化的 Transformer 架构，支持 GQA（Grouped-Query Attention），上下文长度 128K，词汇量 128K，BF16 精度。模型在昇腾 NPU 上通过 vLLM-Ascend 原生推理，架构 LlamaForCausalLM 已被 vLLM-Ascend 完整支持，无需额外适配即可运行。

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0+empty`
`transformers`	`4.57.6`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`CANN`	`8.5.1`
`SOC`	`ascend910_9391`

NPU：1 逻辑卡（Ascend910，64GB HBM）
模型路径：/opt/atomgit/Llama-3.2-1B
服务端口：8000

3. 模型配置

{
  "architectures": ["LlamaForCausalLM"],
  "hidden_size": 2048,
  "intermediate_size": 8192,
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_key_value_heads": 8,
  "head_dim": 64,
  "max_position_embeddings": 131072,
  "rope_theta": 500000.0,
  "vocab_size": 128256,
  "torch_dtype": "bfloat16",
  "tie_word_embeddings": true
}

4. 服务启动

启动前可先检查端口：

ss -lntp | grep ':8000 ' || true

已验证通过的启动命令：

export ASCEND_VISIBLE_DEVICES=0,1
export ASCEND_RT_VISIBLE_DEVICES=0,1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1

vllm serve /opt/atomgit/Llama-3.2-1B \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --seed 1024 \
  --served-model-name llama-3.2-1b \
  --max-model-len 4096 \
  --max-num-seqs 16 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --dtype bfloat16

说明：

1B 模型单卡（TP=1）即可运行，约占用 2.3GB HBM
max-model-len 可根据需要调整，模型支持最大 128K 上下文
ASCEND_VISIBLE_DEVICES 需设置为逻辑设备编号（从 0 开始），而非物理芯片 ID

5. Smoke 验证

基础检查：

curl -sf http://127.0.0.1:8000/v1/models

推理测试：

curl -sf http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-1b",
    "prompt": "The capital of France is",
    "max_tokens": 32,
    "temperature": 0
  }'

验证结果：

/v1/models 返回 200，模型 llama-3.2-1b 已加载
/v1/completions 返回 200，输出 Paris. It is the most visited city in the world.

6. 精度评测

使用 MMLU 5-shot 对模型进行精度评测，对比官方基准。

评测命令

python scripts/eval_accuracy.py --api http://127.0.0.1:8000 --model_name llama-3.2-1b --baseline 32.2 --output scripts/accuracy_results.json

评测配置

参数	值
数据集	MMLU
评测方式	5-shot
样本总数	14,042
科目数	57
Temperature	0

评测结果

指标	NPU 结果	官方基准（GPU）	差异
Macro Avg Accuracy	33.00%	32.2%	+0.80%
Micro Avg Accuracy	32.57%	-	-

部分科目详情

科目	正确/总数	准确率
abstract_algebra	28/100	28.0%
anatomy	54/135	40.0%
business_ethics	36/100	36.0%
computer_security	54/100	54.0%
high_school_geography	81/198	40.9%
international_law	56/121	46.3%
marketing	108/234	46.2%
miscellaneous	346/783	44.2%
medical_genetics	43/100	43.0%
us_foreign_policy	49/100	49.0%
world_religions	67/171	39.2%

结论：NPU 上 MMLU 5-shot Macro Avg 为 33.00%，与官方 GPU 基准 32.2% 差异 +0.80%，小于 1% 精度要求阈值，验证通过。

7. 性能参考

测试条件：128 input / 128 output / concurrency=4，20 requests。

指标	数值
`duration`	`6.10 s`
`request_throughput`	`3.277 req/s`
`output_throughput`	`419.5 tok/s`
`total_token_throughput`	`698.07 tok/s`
`mean_latency_ms`	`1219.28 ms`
`median_latency_ms`	`1217.45 ms`
`p99_latency_ms`	`1269.51 ms`
`mean_tpot_ms`	`9.53 ms/token`

8. 注意事项

设备编号：ASCEND_VISIBLE_DEVICES 需使用从 0 开始的逻辑设备编号。如系统显示物理 Chip ID 为 4/5，实际需设置为 0,1。使用错误的设备编号会导致 aclInit 错误（error code 107001）。
模型架构兼容：LlamaForCausalLM 已被 vLLM-Ascend 0.18.0 原生支持，无需额外适配代码或算子替换。
显存占用：1B BF16 模型约占用 2.3GB HBM，Ascend910 单卡（64GB）有充足余量。可通过 --gpu-memory-utilization 控制预分配比例。
此为基础版模型：本验证使用的是预训练基础版（非 Instruct 版），输出为续写模式。如需对话场景，建议使用 meta-llama/Llama-3.2-1B-Instruct。