Llama-3.2-3B 昇腾 NPU 适配验证报告 #+NPU

模型: LLM-Research/Llama-3.2-3B | 硬件: Ascend910 | 后端: vLLM-Ascend 0.18.0rc1 MMLU 5-shot: 58.01% (NPU) vs 58.0% (GPU) = +0.01% 差异, PASS

1. 简介

本文档记录 LLM-Research/Llama-3.2-3B（预训练基础版，3.21B 参数）在华为昇腾 NPU 上的适配与验证结果。

Llama 3.2 3B 是 Meta 发布的轻量级多语言大语言模型，使用优化的 Transformer 架构，支持 GQA（Grouped-Query Attention），上下文长度 128K，词汇量 128K，BF16 精度。

适配结论: 模型架构 LlamaForCausalLM 已被 vLLM-Ascend 0.18.0 原生支持，无需额外适配代码或算子替换，开箱即用。

相关链接:

权重下载（ModelScope）: https://modelscope.cn/models/LLM-Research/Llama-3.2-3B
Docker 镜像: quay.io/ascend/vllm-ascend:v0.18.0rc1
参考文档: https://docs.vllm.ai/projects/ascend/zh-cn/v0.18.0/tutorials/models/Qwen3.5-27B.html

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0+empty`
`transformers`	`4.57.6`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`torch`	`2.9.0+cpu`
`CANN`	`8.5.1`
`SOC`	`ascend910_9391`
`Python`	`3.11.14`

NPU: 1 逻辑卡 (Ascend910, 64GB HBM)
模型路径: /opt/atomgit/Llama-3.2-3B
服务端口: 8000

3. 模型配置

{
  "architectures": ["LlamaForCausalLM"],
  "hidden_size": 3072,
  "intermediate_size": 8192,
  "num_attention_heads": 24,
  "num_hidden_layers": 28,
  "num_key_value_heads": 8,
  "head_dim": 128,
  "max_position_embeddings": 131072,
  "rope_theta": 500000.0,
  "vocab_size": 128256,
  "torch_dtype": "bfloat16",
  "tie_word_embeddings": true
}

4. 部署指南

4.1 下载模型权重

modelscope download --model LLM-Research/Llama-3.2-3B --local_dir /opt/atomgit/Llama-3.2-3B

4.2 环境变量

export ASCEND_VISIBLE_DEVICES=0,1
export ASCEND_RT_VISIBLE_DEVICES=0,1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1

4.3 启动推理服务

vllm serve /opt/atomgit/Llama-3.2-3B \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --seed 1024 \
  --served-model-name llama-3.2-3b \
  --max-model-len 4096 \
  --max-num-seqs 16 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --dtype bfloat16

说明:

3B 模型单卡 (TP=1) 即可运行，约占用 6.5GB HBM
ASCEND_VISIBLE_DEVICES 需设置为逻辑设备编号（从 0 开始），而非物理芯片 ID
max-model-len 可根据需要调整，模型支持最大 128K 上下文

5. 推理脚本使用

inference.py 支持三种模式：

5.1 通过 API 推理

python inference.py --api http://127.0.0.1:8000 --prompt "The capital of France is"

5.2 批量推理

echo -e "What is AI?\nExplain quantum computing." > prompts.txt
python inference.py --api http://127.0.0.1:8000 --prompts prompts.txt --output results.json

5.3 启动服务

python inference.py --serve --model_path /opt/atomgit/Llama-3.2-3B --port 8000

6. Smoke 验证

# 检查模型加载
curl -sf http://127.0.0.1:8000/v1/models

# 推理测试
curl -sf http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama-3.2-3b","prompt":"The capital of France is","max_tokens":32,"temperature":0}'

验证结果:

/v1/models 返回 200, 模型 llama-3.2-3b 已加载
/v1/completions 返回 200, 输出 Paris. It is the largest city in France and the capital of the Île-de-France region.

7. 精度评测

评测命令

python scripts/eval_accuracy.py --api http://127.0.0.1:8000 --output scripts/accuracy_results.json

评测配置

参数	值
数据集	MMLU
评测方式	5-shot
样本总数	14,042
科目数	57
Temperature	0

评测结果

指标	NPU 结果	官方基准 (GPU)	差异
Macro Avg Accuracy	58.01%	58.0%	+0.01%
Micro Avg Accuracy	56.47%	-	-

部分科目详情

科目	正确/总数	准确率
abstract_algebra	39/100	39.0%
anatomy	80/135	59.3%
astronomy	91/152	59.9%
business_ethics	58/100	58.0%
clinical_knowledge	169/265	63.8%
college_biology	97/144	67.4%
computer_security	71/100	71.0%
high_school_geography	144/198	72.7%
high_school_government_and_politics	149/193	77.2%
high_school_psychology	419/545	76.9%
marketing	188/234	80.3%
miscellaneous	591/783	75.5%
sociology	151/201	75.1%
us_foreign_policy	82/100	82.0%
world_religions	128/171	74.9%

结论: NPU 上 MMLU 5-shot Macro Avg 为 58.01%, 与官方 GPU 基准 58.0% 差异仅 +0.01%, 远小于 1% 精度阈值, 验证通过。

8. 性能评测

评测命令

python scripts/eval_performance.py --api http://127.0.0.1:8000 \
  --input_len 128 --output_len 128 --concurrency 4 --num_requests 20 \
  --output scripts/perf_results.json

评测配置

参数	值
输入长度	~128 tokens
输出长度	128 tokens
并发数	4
请求数	20

评测结果

指标	数值
请求吞吐	2.132 req/s
输出吞吐	272.9 tok/s
总 Token 吞吐	454.12 tok/s
平均延迟	1875.06 ms
中位延迟	1916.95 ms
P99 延迟	1967.61 ms
平均 TPOT	14.65 ms/token
成功率	20/20 (100%)

9. 文件清单

Llama-3.2-3B-NPU/
├── README.md                    # 本部署文档
├── inference.py                 # 推理脚本（支持 API/本地/服务模式）
├── scripts/
│   ├── eval_accuracy.py         # MMLU 精度评测源代码
│   ├── eval_performance.py      # 性能评测源代码
│   ├── accuracy_results.json    # 精度评测结果
│   └── perf_results.json        # 性能评测结果
└──logs/
     ├── smoke_test.log           # Smoke 验证日志
     ├── accuracy_eval.log        # 精度评测运行日志
     └── performance_eval.log     # 性能评测运行日志

10. 注意事项

设备编号: ASCEND_VISIBLE_DEVICES 需使用从 0 开始的逻辑设备编号。如系统显示物理 Chip ID 为 4/5，实际需设置为 0,1。使用错误的设备编号会导致 aclInit 错误 (error code 107001)。
模型架构兼容: LlamaForCausalLM 已被 vLLM-Ascend 0.18.0 原生支持，无需额外适配代码或算子替换。
显存占用: 3B BF16 模型约占用 6.5GB HBM，Ascend910 单卡 (64GB) 有充足余量。可通过 --gpu-memory-utilization 控制预分配比例。
基础版模型: 本验证使用的是预训练基础版（非 Instruct 版），输出为续写模式。如需对话场景，建议使用 meta-llama/Llama-3.2-3B-Instruct。

Llama-3.2-3B 昇腾 NPU 适配验证报告 #+NPU

模型: LLM-Research/Llama-3.2-3B | 硬件: Ascend910 | 后端: vLLM-Ascend 0.18.0rc1 MMLU 5-shot: 58.01% (NPU) vs 58.0% (GPU) = +0.01% 差异, PASS

1. 简介

本文档记录 LLM-Research/Llama-3.2-3B（预训练基础版，3.21B 参数）在华为昇腾 NPU 上的适配与验证结果。

Llama 3.2 3B 是 Meta 发布的轻量级多语言大语言模型，使用优化的 Transformer 架构，支持 GQA（Grouped-Query Attention），上下文长度 128K，词汇量 128K，BF16 精度。

适配结论: 模型架构 LlamaForCausalLM 已被 vLLM-Ascend 0.18.0 原生支持，无需额外适配代码或算子替换，开箱即用。

相关链接:

权重下载（ModelScope）: https://modelscope.cn/models/LLM-Research/Llama-3.2-3B
Docker 镜像: quay.io/ascend/vllm-ascend:v0.18.0rc1
参考文档: https://docs.vllm.ai/projects/ascend/zh-cn/v0.18.0/tutorials/models/Qwen3.5-27B.html

2. 验证环境

组件	版本
`vllm-ascend`	`0.18.0rc1`
`vllm`	`0.18.0+empty`
`transformers`	`4.57.6`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`torch`	`2.9.0+cpu`
`CANN`	`8.5.1`
`SOC`	`ascend910_9391`
`Python`	`3.11.14`

NPU: 1 逻辑卡 (Ascend910, 64GB HBM)
模型路径: /opt/atomgit/Llama-3.2-3B
服务端口: 8000

3. 模型配置

{
  "architectures": ["LlamaForCausalLM"],
  "hidden_size": 3072,
  "intermediate_size": 8192,
  "num_attention_heads": 24,
  "num_hidden_layers": 28,
  "num_key_value_heads": 8,
  "head_dim": 128,
  "max_position_embeddings": 131072,
  "rope_theta": 500000.0,
  "vocab_size": 128256,
  "torch_dtype": "bfloat16",
  "tie_word_embeddings": true
}

4. 部署指南

4.1 下载模型权重

modelscope download --model LLM-Research/Llama-3.2-3B --local_dir /opt/atomgit/Llama-3.2-3B

4.2 环境变量

export ASCEND_VISIBLE_DEVICES=0,1
export ASCEND_RT_VISIBLE_DEVICES=0,1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1

4.3 启动推理服务

vllm serve /opt/atomgit/Llama-3.2-3B \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --seed 1024 \
  --served-model-name llama-3.2-3b \
  --max-model-len 4096 \
  --max-num-seqs 16 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --dtype bfloat16

说明:

3B 模型单卡 (TP=1) 即可运行，约占用 6.5GB HBM
ASCEND_VISIBLE_DEVICES 需设置为逻辑设备编号（从 0 开始），而非物理芯片 ID
max-model-len 可根据需要调整，模型支持最大 128K 上下文

5. 推理脚本使用

inference.py 支持三种模式：

5.1 通过 API 推理

python inference.py --api http://127.0.0.1:8000 --prompt "The capital of France is"

5.2 批量推理

echo -e "What is AI?\nExplain quantum computing." > prompts.txt
python inference.py --api http://127.0.0.1:8000 --prompts prompts.txt --output results.json

5.3 启动服务

python inference.py --serve --model_path /opt/atomgit/Llama-3.2-3B --port 8000

6. Smoke 验证

# 检查模型加载
curl -sf http://127.0.0.1:8000/v1/models

# 推理测试
curl -sf http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama-3.2-3b","prompt":"The capital of France is","max_tokens":32,"temperature":0}'

验证结果:

/v1/models 返回 200, 模型 llama-3.2-3b 已加载
/v1/completions 返回 200, 输出 Paris. It is the largest city in France and the capital of the Île-de-France region.

7. 精度评测

评测命令

python scripts/eval_accuracy.py --api http://127.0.0.1:8000 --output scripts/accuracy_results.json

评测配置

参数	值
数据集	MMLU
评测方式	5-shot
样本总数	14,042
科目数	57
Temperature	0

评测结果

指标	NPU 结果	官方基准 (GPU)	差异
Macro Avg Accuracy	58.01%	58.0%	+0.01%
Micro Avg Accuracy	56.47%	-	-

部分科目详情

科目	正确/总数	准确率
abstract_algebra	39/100	39.0%
anatomy	80/135	59.3%
astronomy	91/152	59.9%
business_ethics	58/100	58.0%
clinical_knowledge	169/265	63.8%
college_biology	97/144	67.4%
computer_security	71/100	71.0%
high_school_geography	144/198	72.7%
high_school_government_and_politics	149/193	77.2%
high_school_psychology	419/545	76.9%
marketing	188/234	80.3%
miscellaneous	591/783	75.5%
sociology	151/201	75.1%
us_foreign_policy	82/100	82.0%
world_religions	128/171	74.9%

结论: NPU 上 MMLU 5-shot Macro Avg 为 58.01%, 与官方 GPU 基准 58.0% 差异仅 +0.01%, 远小于 1% 精度阈值, 验证通过。

8. 性能评测

评测命令

python scripts/eval_performance.py --api http://127.0.0.1:8000 \
  --input_len 128 --output_len 128 --concurrency 4 --num_requests 20 \
  --output scripts/perf_results.json

评测配置

参数	值
输入长度	~128 tokens
输出长度	128 tokens
并发数	4
请求数	20

评测结果

指标	数值
请求吞吐	2.132 req/s
输出吞吐	272.9 tok/s
总 Token 吞吐	454.12 tok/s
平均延迟	1875.06 ms
中位延迟	1916.95 ms
P99 延迟	1967.61 ms
平均 TPOT	14.65 ms/token
成功率	20/20 (100%)

9. 文件清单

Llama-3.2-3B-NPU/
├── README.md                    # 本部署文档
├── inference.py                 # 推理脚本（支持 API/本地/服务模式）
├── scripts/
│   ├── eval_accuracy.py         # MMLU 精度评测源代码
│   ├── eval_performance.py      # 性能评测源代码
│   ├── accuracy_results.json    # 精度评测结果
│   └── perf_results.json        # 性能评测结果
└──logs/
     ├── smoke_test.log           # Smoke 验证日志
     ├── accuracy_eval.log        # 精度评测运行日志
     └── performance_eval.log     # 性能评测运行日志

10. 注意事项

设备编号: ASCEND_VISIBLE_DEVICES 需使用从 0 开始的逻辑设备编号。如系统显示物理 Chip ID 为 4/5，实际需设置为 0,1。使用错误的设备编号会导致 aclInit 错误 (error code 107001)。
模型架构兼容: LlamaForCausalLM 已被 vLLM-Ascend 0.18.0 原生支持，无需额外适配代码或算子替换。
显存占用: 3B BF16 模型约占用 6.5GB HBM，Ascend910 单卡 (64GB) 有充足余量。可通过 --gpu-memory-utilization 控制预分配比例。
基础版模型: 本验证使用的是预训练基础版（非 Instruct 版），输出为续写模式。如需对话场景，建议使用 meta-llama/Llama-3.2-3B-Instruct。