本文档记录 Skywork/Skywork-Reward-V2-Qwen3-8B 在 Ascend NPU 环境下的快速部署流程及验证结果。
Skywork-Reward-V2-Qwen3-8B 是基于 Qwen3-8B 构建的 Bradley-Terry 奖励模型,主要用于评估对话质量并生成相应的奖励分数。该模型在 2600 万条偏好对数据上完成训练,在 RewardBench 等多个基准测试中均达到行业领先水平。
权重下载地址(ModelScope):https://modelscope.cn/models/Skywork/Skywork-Reward-V2-Qwen3-8B
| 组件 | 版本 |
|---|---|
NPU | Ascend910 |
PyTorch | 2.9.0 |
torch-npu | 2.9.0.post1+gitee7ba04 |
transformers | 5.8.1 |
Python | 3.11.14 |
1 逻辑卡/opt/atomgit/model_adapt/4_Skywork_Skywork-Reward-V2-Qwen3-8B/model/Skywork/Skywork-Reward-V2-Qwen3-8B该模型为奖励模型,可直接使用 HuggingFace Transformers 进行推理,无需依赖 vLLM 服务。
推理命令:
python3 inference.py推理脚本核心代码:
import torch
import torch_npu
from transformers import AutoModelForSequenceClassification, AutoTokenizer
device = "npu:0"
model = AutoModelForSequenceClassification.from_pretrained(
model_path,
dtype=torch.bfloat16,
num_labels=1,
).to(device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
# 构建对话并获取奖励分数
conv = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response}]
formatted = tokenizer.apply_chat_template(conv, tokenize=False)
if tokenizer.bos_token is not None and formatted.startswith(tokenizer.bos_token):
formatted = formatted[len(tokenizer.bos_token):]
inputs = tokenizer(formatted, return_tensors="pt").to(device)
with torch.no_grad():
score = model(**inputs).logits[0][0].item()python3 -c "
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch, torch_npu
model_path = 'model/Skywork/Skywork-Reward-V2-Qwen3-8B'
device = 'npu:0'
model = AutoModelForSequenceClassification.from_pretrained(model_path, dtype=torch.bfloat16, num_labels=1).to(device)
tok = AutoTokenizer.from_pretrained(model_path)
prompt = 'What is 2+2?'
resp = '2+2=4'
conv = tok.apply_chat_template([{'role':'user','content':prompt},{'role':'assistant','content':resp}], tokenize=False)
if tok.bos_token and conv.startswith(tok.bos_token):
conv = conv[len(tok.bos_token):]
inp = tok(conv, return_tensors='pt').to(device)
with torch.no_grad():
s = model(**inp).logits[0][0].item()
print(f'Reward score: {s:.4f}')
print('Smoke test PASSED')
"验证结果:模型正确加载到 Ascend910 NPU,推理输出正常。
测试条件:单轮推理,bfloat16 精度,50 次迭代。
| 指标 | 数值 |
|---|---|
| 平均延迟 | 42.79 ms |
| P50 延迟 | 42.76 ms |
| P99 延迟 | 43.75 ms |
| 吞吐量 | 23.37 inferences/s |
| 总测试耗时 | 2.14 s |
Qwen3ForSequenceClassification(奖励模型),不是文本生成模型,不支持 vLLM chat completions 接口apply_chat_template 构建输入时,注意去除重复的 bos_tokenNPU vs CPU 精度对比(CPU 为基线,NPU 为验证目标):
| 指标 | 数值 |
|---|---|
| 测试用例数 | 2 |
| 预测一致性 | 2/2 (100.0%) |
| 精度要求 | NPU vs CPU 最大 logits 误差 < 1.0% |
| 精度结论 | ✅ 通过 (准确率 100.0%) |
精度评测源代码和日志详见 eval/ 目录。