Llama-3.2-1B-Instruct (Ascend NPU)

Meta Llama 3.2 1B Instruct 模型在 华为昇腾 Ascend NPU 上的完整适配与部署方案。

模型简介

模型名称: meta-llama/Llama-3.2-1B-Instruct
架构: LlamaForCausalLM (GQA, RoPE llama3)
参数量: 1B (hidden_size=2048, 16 layers, 32 attention heads, 8 KV heads)
精度: bfloat16
最大序列长度: 131072 tokens
词表大小: 128256
任务: 对话生成、代码生成、知识问答、翻译、推理等

适配概述

项目	详情
NPU 平台	Ascend910 (Atlas 800 A2)
CANN 版本	8.5.1
推理框架	vLLM 0.18.0 + vLLM-Ascend 0.18.0rc1
torch_npu	2.9.0.post1
适配方式	原生支持，无需代码修改
算子兼容性	全部标准 PyTorch 算子，无 CUDA-only 依赖

环境准备

# 基础环境 (CANN 8.5.1 + torch_npu 2.9.0 需预先安装)
pip install vllm vllm-ascend

# 验证 NPU 可用
npu-smi info
python -c "import torch; import torch_npu; print(torch.npu.device_count())"

快速开始

1. 下载模型

# 从 HuggingFace 下载
pip install huggingface_hub
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \
  --local-dir ./LLM-Research/Llama-3.2-1B-Instruct

# 或从 ModelScope 下载
pip install modelscope
python -c "from modelscope import snapshot_download; \
  snapshot_download('LLM-Research/Llama-3.2-1B-Instruct', cache_dir='.')"

2. 启动推理服务

vllm serve ./LLM-Research/Llama-3.2-1B-Instruct \
  --dtype bfloat16 \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --port 8000

3. 调用推理

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "./LLM-Research/Llama-3.2-1B-Instruct",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "temperature": 0,
    "max_tokens": 128
  }'

响应示例:

{
  "choices": [{
    "message": {
      "content": "Hello! I'm just a computer program, so I don't have feelings, but I'm functioning properly and ready to help you with any questions or tasks you have. How can I assist you today?"
    }
  }],
  "usage": {"prompt_tokens": 37, "completion_tokens": 32, "total_tokens": 69}
}

4. Python 客户端

python inference.py --prompt "What is the capital of France?"
# Output: The capital of France is Paris.

python inference.py --interactive  # 交互对话模式
python inference.py --health        # 健康检查

精度验证

与 CPU (transformers bfloat16) 基线在温度=0 下进行确定性解码对比：

测试类别	提示词	精度	结论
知识问答	What is the capital of France?	100%	✅ 精确匹配
创意写作	Write a haiku about AI	100%	✅ 精确匹配
科学解释	Explain what DNA is	100%	✅ 精确匹配
翻译	Translate 'good morning' to Chinese	100%	✅ 精确匹配
数学	What is 15 + 27?	100%	✅ 精确匹配
代码生成	Write a Python function named greet	100%	✅ 精确匹配
逻辑推理	If all dogs are animals...	100%	✅ 精确匹配
摘要	Summarize photosynthesis in one word	100%	✅ 精确匹配

Token 匹配率: 100.00% | 误差 < 1% ✅ PASS

注: 精度评测使用确定性贪心解码 (temperature=0, do_sample=False)。报告详见 accuracy_report.json。

性能基准

指标	数值
设备	Ascend910 x1
模型加载时间	~1.0s
权重占用	2.32 GB
KV Cache 可用	52.66 GiB (4096 tokens 下可并发 ~421 请求)
单请求延迟 (32 tokens)	< 1s

项目文件

文件	说明
`inference.py`	NPU 推理客户端脚本
`evaluate_final.py`	精度评测脚本 (NPU vs CPU)
`accuracy_report.json`	精度评测详细报告
`README.md`	本文档

特性支持

特性	状态
文本对话 (Chat Completions)	✅
流式输出 (Streaming)	✅
Chunked Prefill	✅
Prefix Caching	✅
多模态 (Vision)	N/A (纯文本模型)
MoE	N/A
量化 (Quantization)	⚠️ 待验证
Tensor Parallel (多卡)	✅ 支持

常见问题

Q: 启动时提示 "owner does not match" 警告？ A: CANN 8.5.1 安装在 root 用户下，当前用户不同。这是非关键警告，不影响推理。

Q: 如何调整 KV Cache 大小？ A: 通过 --max-model-len 参数。值越小，KV cache 占用越少，可支持的并发越多。

许可证

本模型基于 Meta Llama 3.2 Community License。详见 LICENSE.txt。

#+NPU | Adapted for Huawei Ascend NPU | vLLM-Ascend 0.18.0