maple77/bce-reranker-base_v1 on Ascend NPU

1. 简介

本仓库记录 maple77/bce-reranker-base_v1 在华为昇腾 Ascend910 NPU 上的适配与验证结果。

bce-reranker-base_v1 是一个基于 XLM-RoBERTa 的 Cross-encoder 重排序模型（Reranker），用于对 query-passage 对进行相关性评分。分数越高表示相关性越强。

模型架构：XLMRobertaForSequenceClassification
参数量：~278M
vocab_size：250002
hidden_size：768
num_hidden_layers：12
num_attention_heads：12
最大序列长度：514

2. 验证环境

组件	版本
NPU	Ascend910 (2卡) / 25.5.2
PyTorch	2.x
torch_npu	2.x
transformers	4.35.0+
Python	3.x

NPU 卡：逻辑卡 7（Chip 0/1）
模型路径：~/maple77/bce-reranker-base_v1/model/maple77/bce-reranker-base_v1/

3. 推理脚本

由于本模型是 Cross-encoder Reranker（非 Causal LM），不能使用 vLLM 部署，使用原生 PyTorch + torch_npu 推理。

已验证通过的推理方式：

python3 ~/maple77/bce-reranker-base_v1/inference.py

核心推理代码：

import torch
import torch_npu
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 加载模型
model_path = "~/maple77/bce-reranker-base_v1/model/maple77/bce-reranker-base_v1"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=1)
model = model.npu()
model.eval()

# 编码输入
inputs = tokenizer(query, passage, padding=True, truncation=True,
                   max_length=512, return_tensors="pt")
inputs = {k: v.npu() for k, v in inputs.items()}

# 推理
with torch.no_grad():
    outputs = model(**inputs)
    score = outputs.logits.squeeze().item()

4. Smoke 验证

python3 ~/maple77/bce-reranker-base_v1/inference.py

预期输出：

Loading tokenizer...
Loading model...
Moving model to NPU...
Running inference...
Query: What is the capital of China?
Passage: Beijing is the capital of the People's Republic of China.
Relevance Score: 0.5448
Inference successful!

5. 性能参考

测试条件：输入长度 ~30 tokens（query + passage），单次推理，NPU 推理 100 次迭代。

指标	数值
硬件	Ascend910
平均延迟	6.84 ms
P50 延迟	6.76 ms
P90 延迟	7.16 ms
P95 延迟	7.27 ms
P99 延迟	7.87 ms
最小延迟	6.52 ms
最大延迟	7.87 ms

详细性能结果见 eval/performance.json。

7. 交付物

文件	说明
`README.md`	本文档
`inference.py`	NPU 推理脚本
`eval/accuracy_eval.py`	精度评测源代码
`eval/performance_eval.py`	性能评测源代码
`eval/accuracy.json`	精度结果
`eval/performance.json`	性能结果
`eval/run_log.txt`	完整运行日志

8. 注意事项

该模型是 Cross-encoder Reranker，不是 Causal LM，不支持 vLLM
需使用 transformers 的 AutoModelForSequenceClassification 加载
模型和输入 tensor 都必须迁移到 NPU（.npu()）
需要安装 torch_npu 和 transformers
分数越高表示 query-passage 相关性越强

Ascend NPU 精度评测

NPU 推理验证：

指标	数值
测试用例数	5
预测一致性	5/5 (100.0%)
精度结论	✅ 通过 (准确率 100.0%)

精度评测源代码和日志详见 eval/ 目录。