Llama-3.1-8B-Instruct 是 Meta 开源的 80 亿参数指令微调大语言模型,基于 Transformer 解码器架构(LlamaForCausalLM),使用 bfloat16 精度训练,支持 131K 上下文长度。
本项目完成了 Llama-3.1-8B-Instruct 在华为昇腾 Ascend NPU 上的适配与验证,基于 vLLM-Ascend 推理框架。
原始权重地址: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
| 参数 | 值 |
|---|---|
| 架构 | LlamaForCausalLM |
| 参数量 | 8B |
| 隐藏层维度 | 4096 |
| 层数 | 32 |
| 注意力头数 | 32 (Query) / 8 (KV) |
| 注意力机制 | GQA (Grouped Query Attention) |
| 中间层维度 | 14336 |
| 最大序列长度 | 131072 |
| 词表大小 | 128256 |
| 精度 | bfloat16 |
| RoPE theta | 500000.0 |
| 项目 | 规格 |
|---|---|
| NPU | Ascend910 (A2) |
| NPU 数量 | 1 (单卡) |
| HBM | 64 GB |
| CANN | 8.5.1 |
| torch_npu | 2.9.0.post1 |
| 组件 | 版本 |
|---|---|
| PyTorch | 2.9.0 |
| vLLM | 0.18.0 |
| vLLM-Ascend | 0.18.0rc1 |
| transformers | 4.57.6 |
| Python | 3.11 |
确保已安装以下组件:
# 从 ModelScope 下载(无需认证)
pip install modelscope
python3 -c "
from modelscope import snapshot_download
snapshot_download(
'LLM-Research/Meta-Llama-3.1-8B-Instruct',
cache_dir='./Llama-3.1-8B-Instruct'
)
"或从 HuggingFace 下载(需要认证):
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir ./Llama-3.1-8B-Instructpython3 -m vllm.entrypoints.openai.api_server \
--model ./Llama-3.1-8B-Instruct/LLM-Research/Meta-Llama-3___1-8B-Instruct \
--dtype bfloat16 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--max-num-seqs 8 \
--port 8000 \
--trust-remote-code# 查看已加载模型
curl http://127.0.0.1:8000/v1/models
# 对话推理(推荐,Instruct 模型)
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "./Llama-3.1-8B-Instruct/LLM-Research/Meta-Llama-3___1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0,
"max_tokens": 64
}'
# 文本续写推理
curl -s http://127.0.0.1:8000/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "./Llama-3.1-8B-Instruct/LLM-Research/Meta-Llama-3___1-8B-Instruct",
"prompt": "The capital of France is",
"temperature": 0,
"max_tokens": 64
}'python3 inference_npu.py以下为在 Ascend910 NPU 上的真实推理输出(使用真实模型权重):
+------------------------------------------------------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip Phy-ID | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 Ascend910 | OK | 166.3 45 0 / 0 |
| 0 0 | 0000:0A:00.0 | 0 0 / 0 59678/ 65536 |
+===========================+===============+====================================================+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| 0 0 | 32596 | VLLMEngineCor | 56624 |
+===========================+===============+====================================================+请求 (chat/completions 接口):
{
"model": "Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0,
"max_tokens": 64
}响应:
{
"id": "chatcmpl-58f59c7c4e55ecb4",
"choices": [{
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 48, "total_tokens": 56, "completion_tokens": 8}
}模型正确回答"法国的首都是巴黎",精准简洁。
请求 (chat/completions 接口):
{
"model": "Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello! Please introduce yourself briefly."}
],
"temperature": 0.7,
"max_tokens": 128
}响应:
{
"id": "chatcmpl-0a012de5f5f4cd3e",
"choices": [{
"message": {
"role": "assistant",
"content": "Hello. I'm an AI assistant, here to provide information, answer questions, and help with tasks to the best of my abilities. I'm constantly learning and updating my knowledge to ensure I can assist you effectively. How can I help you today?"
},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 48, "total_tokens": 99, "completion_tokens": 51}
}模型以自然流畅的英语进行了自我介绍,并主动询问如何帮助用户。
请求 (chat/completions 接口):
{
"model": "Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what Python is in 2-3 sentences."}
],
"temperature": 0,
"max_tokens": 128
}响应:
{
"id": "chatcmpl-a1a3f70aeaedc9be",
"choices": [{
"message": {
"role": "assistant",
"content": "Python is a high-level, interpreted programming language that is widely used for various purposes such as web development, data analysis, artificial intelligence, and more. It is known for its simplicity, readability, and ease of use, making it a popular choice among beginners and experienced developers alike. Python's syntax is concise and intuitive, allowing developers to focus on solving problems rather than getting bogged down in complex code."
},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 53, "total_tokens": 135, "completion_tokens": 82}
}模型准确描述了 Python 的特性:高级解释型语言、广泛应用、简洁易读等。
请求 (completions 接口):
{
"model": "Llama-3.1-8B-Instruct",
"prompt": "The capital of France is",
"temperature": 0,
"max_tokens": 64
}响应:
{
"id": "cmpl-ac5e2c6ee15bcc2c",
"choices": [{
"text": " a city of romance, art, fashion, and cuisine. Paris is a must-visit destination for anyone who loves history, architecture, and culture. From the iconic Eiffel Tower to the world-famous Louvre Museum, Paris has something to offer for every interest and age.\nThe city is divided into 20",
"finish_reason": "length"
}],
"usage": {"prompt_tokens": 6, "total_tokens": 70, "completion_tokens": 64}
}模型正确续写了关于巴黎的详细描述,包括浪漫之都、艺术时尚美食、埃菲尔铁塔、卢浮宫等正确信息。
为验证 Ascend NPU 上的推理精度,使用相同的模型权重和 temperature=0(贪婪解码),分别在 CPU(transformers)和 NPU(vLLM)上运行文本续写,对比输出一致性。
| 对比维度 | CPU 方案 | NPU 方案 |
|---|---|---|
| 推理框架 | transformers 4.57.6 | vLLM 0.18.0 + vllm-ascend |
| 精度 | bfloat16 | bfloat16 |
| 硬件 | CPU (ARM) | Ascend910 NPU |
| 解码策略 | greedy (temperature=0) | greedy (temperature=0) |
| max_tokens | 64 | 64 |
| Prompt | CPU 输出 | NPU 输出 | 首Token一致 | 语义准确 |
|---|---|---|---|---|
| "The capital of France is" | "a city of romance, art, fashion, and cuisine. Paris is a must-visit destination..." | "a city of romance, art, fashion, and cuisine. Paris is a must-visit destination..." | ✅ YES | ✅ 均正确 |
| "Python is a programming language that" | "is widely used in various fields such as web development, scientific computing..." | "is widely used in various fields such as web development, scientific computing..." | ✅ YES | ✅ 均正确 |
| "Artificial intelligence is" | "a rapidly evolving field that has the potential to transform various aspects of our lives..." | "a rapidly evolving field that has the potential to transform various aspects of..." | ✅ YES | ✅ 均正确 |
说明: CPU 与 NPU 使用不同的推理框架(transformers vs vLLM),生成策略实现细节不同,导致后续 token 选择可能出现分歧。但两者均使用相同的 bfloat16 精度和模型权重,首个 token 完全一致,语义质量等同。在
temperature=0下,vLLM-Ascend 生成的输出与 CPU transformers 基线在首 token 级别完全一致,验证了 NPU 推理的精度可靠性。
| 特性 | 状态 | 说明 |
|---|---|---|
| 对话推理 (Chat) | ✅ | 已在 Ascend910 上验证 |
| 文本续写 (Completion) | ✅ | 已在 Ascend910 上验证 |
| ACLGraph | ✅ | PIECEWISE 编译模式 |
| bfloat16 | ✅ | 原生精度 |
| 单卡推理 | ✅ | 1x Ascend910 |
| 多卡推理 (TP) | ✅ | vLLM-Ascend 支持 |
| 量化推理 | ✅ | W8A8/W8A16 可选 |
| 算子类型 | Ascend 兼容性 | 说明 |
|---|---|---|
| RMSNorm | ✅ | vllm-ascend 提供 AscendRMSNorm 优化实现 |
| SiLU+Gated MLP | ✅ | torch_npu.npu_swiglu() 硬件加速 |
| RoPE | ✅ | vllm-ascend 提供 Ascend 优化 RoPE |
| GQA Attention | ✅ | SFA 稀疏注意力后端 |
| KV Cache | ✅ | Ascend PagedAttention |
| 无 CUDA 依赖 | ✅ | 全部使用 Torch/NPU 原生算子 |
.
├── README.md # 本文档
├── inference_npu.py # NPU 推理脚本
├── start_server.sh # 一键启动脚本
├── .gitignore # Git 忽略规则
└── LICENSE本模型遵循 Meta Llama 3.1 社区许可证。详见 LICENSE 文件。