BAAI/bge-reranker-v2-minicpm-layerwise on Ascend NPU

1. 简介

本文档记录 BAAI/bge-reranker-v2-minicpm-layerwise 在华为昇腾 Ascend910 NPU 环境下的适配与验证结果。

bge-reranker-v2-minicpm-layerwise 是一款基于 MiniCPM-2B 的多语言 reranker 模型，支持层级别输出（layerwise），用户可自由选择中间层或最终层的评分结果，从而在速度与精度之间取得平衡。

模型架构：LayerWiseMiniCPMForCausalLM
参数量：2.72B
精度：bfloat16
支持语言：多语言（中英文效果最佳）
推理框架：PyTorch + torch_npu
层数：40层（评分支持第8~40层）

2. 验证环境

组件	版本
`NPU`	Ascend910 (9362)
`torch`	2.9.0
`torch-npu`	2.9.0.post1
`transformers`	4.48.3
`Python`	3.11.14

NPU：2 逻辑卡（NPU 0 已被占用，验证使用 NPU 1）
模型路径：/opt/atomgit/models/bge-reranker-v2-minicpm-layerwise

3. 模型加载与推理

由于模型使用自定义代码（trust_remote_code），且环境无互联网访问，采用直接导入自定义模块方式加载模型：

import torch
import torch_npu
from transformers import AutoTokenizer

# 手动加载自定义模块
exec(open("configuration_minicpm_reranker.py").read())
exec(open("modeling_minicpm_reranker.py").read().replace(
    "from .configuration_minicpm_reranker import LayerWiseMiniCPMConfig", "# handled"
))

config = LayerWiseMiniCPMConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = LayerWiseMiniCPMForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, config=config
)
model = model.to("npu:1").eval()

# 推理：给定 query + passage，获取相关性评分
text = query + "\n" + passage
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
inputs = {k: v.to("npu:1") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs, cutoff_layers=[12], output_hidden_states=True)

logits = outputs.logits
score = logits[0][0, -1].item() if isinstance(logits, tuple) else logits[0, -1].item()

4. Smoke 验证

# 测试 query-passage 相关性打分
pairs = [
    ("what is panda?", "The giant panda is a bear species endemic to China."),  # 相关
    ("what is panda?", "hi"),  # 不相关
]

for query, passage in pairs:
    text = query + "\n" + passage
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    inputs = {k: v.to("npu:1") for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs, cutoff_layers=[12], output_hidden_states=True)
    logits = outputs.logits
    score = logits[0][0, -1].item() if isinstance(logits, tuple) else logits[0, -1].item()
    print(f"Score: {score:.4f}")

验证结果：

模型成功加载到 NPU，加载时间 7.14 秒
相关段落得分（14.19）高于不相关段落得分（-3.05）
所有测试用例均正确排序

5. 性能参考

测试条件：单条 query-passage 对，序列长度 128，cutoff_layers=[12]。

指标	数值
模型参数量	2.72B
模型加载时间	7.14 秒
单次推理平均时间	0.0134 秒
吞吐量	74.40 次/秒
测试次数	5 次

6. 精度评测

测试条件：3 组 query-passage 对，每组包含 1 个相关段落和 1 个不相关段落。

测试用例	相关段落得分	不相关段落得分	排序正确
what is panda?	14.1875	-3.0469	PASS
what is the capital of France?	10.1875	5.6250	PASS
Tell me about machine learning	12.0000	-7.0312	PASS

整体精度：PASS（所有测试用例均正确排序）

7. 注意事项

模型使用自定义代码，加载时需要 trust_remote_code=True 或直接导入自定义模块
环境无互联网访问时，需通过 local_files_only=True 或直接导入方式加载
推荐使用 cutoff_layers=[12] 在速度和精度之间取得较好平衡
cutoff_layers 有效范围为 [8, 40]
使用 head_type=simple, head_multi=True 配置，每个层一个评分头
在 NPU 上运行时，选择合适的设备（npu:0 或 npu:1），注意设备内存占用
评分需使用 sigmoid 函数将原始分映射到 [0,1] 区间