本文档记录 chad9291/qwen2.5-0.5b-gpu2 在昇腾 NPU(Ascend910)环境的快速部署与验证结果。
Qwen2.5 0.5B 文本生成模型。基于 HuggingFace transformers 框架,支持一键加载推理。
相关获取地址:
| 组件 | 版本 |
|---|---|
torch | 2.1.0 |
torch_npu | 2.1.0 |
transformers | >=4.37.0 |
CANN | 8.5.RC1 |
pip install transformers torchimport torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = torch.device("npu:0" if torch.npu.is_available() else "cpu")
model_name = "chad9291/qwen2.5-0.5b-gpu2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True)
model = model.to(device).eval()
messages = [{"role": "user", "content": "用一句话说明什么是深度学习。"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(f"输出: {response}")NPU 与 CPU logits 数值一致性对比:
| 指标 | 值 |
|---|---|
| Top-1 一致性 | 4/4 |
| Max Logit Diff Ratio | 2.6e-05 |
| Avg KL Divergence | 0.0 |
| 结论 | PASS |
| 指标 | 值 |
|---|---|
| 硬件 | Ascend 910B |
| 平均推理时间 | 1722.85 ms |
| 测试条件 | generate 64 tokens, fp16 |
| runs | 5 |
trust_remote_code=True