deepset/tinyroberta-squad2 on Ascend 910B3

1. 简介

本文档记录 deepset/tinyroberta-squad2 在昇腾 Ascend 910B3 NPU 上的迁移适配、推理部署与精度评测结果。

tinyroberta-squad2 是一个基于 RoBERTa 的抽取式问答模型，在 SQuAD 2.0 数据集上微调，参数量约 81.53M（6 层 Transformer，768 隐藏维）。该模型接受上下文文本和问题作为输入，输出答案在上下文中的起始和结束位置。

本次适配工作包括：

在 NPU（Ascend 910B3）上验证问答推理的正确性
对比 NPU 与 CPU 的输出精度，确保误差 < 1%
提供可直接使用的 NPU 推理脚本 inference.py
提供精度与性能评测脚本 eval.py

2. 验证环境

组件	版本
`Python`	`3.9.13`
`torch`	`2.8.0+cpu`
`torch_npu`	`2.8.0.post4`
`transformers`	`4.57.6`
`numpy`	`1.24.4`

NPU：Ascend 910B3 × 8 逻辑卡
驱动版本：25.5.2

3. 模型适配与部署

3.1 适配说明

tinyroberta-squad2 使用标准 RoBERTa 架构，transformers 库原生支持。NPU 适配无需修改模型结构或权重，仅需将模型和输入张量迁移到 NPU 设备。

已验证通过的适配流程：

import torch
import torch_npu
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

model = AutoModelForQuestionAnswering.from_pretrained("deepset/tinyroberta-squad2")
model = model.npu()
model.eval()

tokenizer = AutoTokenizer.from_pretrained("deepset/tinyroberta-squad2")
inputs = tokenizer(question, context, return_tensors="pt")
inputs = {k: v.npu() for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    start = outputs.start_logits.cpu().argmax()
    end = outputs.end_logits.cpu().argmax()
    answer = tokenizer.decode(inputs["input_ids"][0][start:end+1])

3.2 环境准备

pip install torch torch_npu transformers -i https://repo.huaweicloud.com/repository/pypi/simple/
export HF_ENDPOINT=https://hf-mirror.com

3.3 推理脚本使用

# NPU 推理
python inference.py --context "Python was created by Guido van Rossum in 1991." --question "Who created Python?"

# CPU 推理
python inference.py --context "..." --question "..." --device cpu

# 批量推理（JSON 文件）
python inference.py --input samples.json --output results.json

# 交互模式
python inference.py --interactive

4. Smoke 验证

python inference.py --context "Python was created by Guido van Rossum in 1991." --question "Who created Python?"

预期输出：

QA Results (NPU)

--- QA #1 ---
Q: Who created Python?
A: Guido van Rossum

5. 性能参考

测试条件：batch_size=8，max_seq_len=512，float32 精度，连续 20 次取平均。

指标	CPU	NPU (Ascend 910B3)
平均推理时间 (8 QA pairs)	697.79 ms	11.03 ms
单问答对平均耗时	87.22 ms	1.38 ms
加速比	1x	63.28x
参数量	81.53M	81.53M
模型大小	311.0 MB	311.0 MB

6. 精度评测

评测方法

在 CPU 上加载模型并推理得到参考输出（start/end logits）
在 NPU 上加载同一权重并推理得到 NPU 输出
对比两组输出的差异，计算多个精度指标

评测结果

使用 8 组问答对进行评测：

指标	Start Logits	End Logits	要求	结果
MSE	1.58e-6	1.60e-6	-	-
Max Absolute Error	5.12e-3	6.44e-3	-	-
Cosine Similarity	1.00000024	1.00000000	> 0.99	✓
Prob Mean Diff	0.000105%	0.000072%	< 1%	✓ PASS
Prob Max Diff	0.014085%	0.008792%	< 1%	✓
Span Agreement	8/8 (100%)	8/8 (100%)	> 99%	✓

结论：NPU 精度误差满足要求（< 1%），模型在 NPU 上的推理结果与 CPU 一致。

详细评测日志见 eval_log.txt。

7. 注意事项

权重文件：直接使用本地 .safetensors / .bin 文件，无需修改
设备选择：脚本默认自动检测 NPU，若 NPU 不可用则回退到 CPU
输入长度：单次推理最大序列长度默认 512 tokens（可调整 --max_length）
答案截断：如 start_idx > end_idx，脚本自动交换确保有效范围
torch_npu 版本：确保与 torch 版本匹配
单卡运行：当前仅使用单张 NPU 卡