冬

gcw_IDzXRVNw/mxbai-rerank-xsmall-v1-ascend

mxbai-rerank-xsmall-v1 Ascend NPU 部署指南

项目简介

mxbai-rerank-xsmall-v1 是 Mixedbread AI 开发的文档重排序 (Reranker) 模型，基于 DebertaV2 架构。该模型能够对检索到的文档进行相关性排序，提高搜索和 RAG 系统的准确性。它是小尺寸版本，适用于资源受限的环境。

特性

支持 Ascend NPU 推理加速
CPU 与 NPU 精度对比测试（输出完全一致）
DebertaV2 序列分类器
36 倍加速比
高精度文档重排序

环境要求

硬件：华为 Ascend 910 系列 NPU
CANN：8.0.RC1 或更高版本
PyTorch：2.0+ 且带有 torch_npu
Docker：容器名称 test-modelagent
transformers：4.38+

目录结构

mxbai-rerank-xsmall-v1-ascend/
├── inference.py          # 推理测试脚本
├── log.txt               # 测试日志
├── README.md             # 本文档
├── test_sample.txt       # 测试样例
├── inference_result.json # 推理结果
└── precision_result.json # 精度测试结果

部署步骤

1. 进入容器

docker exec -it test-modelagent bash

2. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

3. 准备模型文件

模型文件位于 /data/ysws/agentsp/5-16/mxbai-rerank-xsmall-v1/mixedbread-ai/mxbai-rerank-xsmall-v1/ 目录下：

model.safetensors - 模型权重 (约 142MB)
config.json - 模型配置
tokenizer.json / tokenizer_config.json - 分词器文件
spm.model - SentencePiece 模型

4. 安装依赖

pip install transformers torch_npu -i https://pypi.huaweicloud.com/repository/pypi/simple/

Usage

Method 1: Normal Inference Mode

Run the inference script for document reranking:

cd /data/ysws/agentsp/5-16/mxbai-rerank-xsmall-v1-ascend/

python3 inference.py --mode inference

方式二：精度测试模式 (CPU vs NPU)

运行精度对比测试，验证 NPU 计算结果与 CPU 一致性：

cd /data/ysws/agentsp/5-16/mxbai-rerank-xsmall-v1-ascend/

python3 inference.py --mode precision_test

方式三：完整测试 (推理 + 精度)

cd /data/ysws/agentsp/5-16/mxbai-rerank-xsmall-v1-ascend/

python3 inference.py --mode all

命令行参数说明

参数	说明	默认值
`--mode`	测试模式: inference, precision_test 或 all	`all`

测试验证

精度测试结果

指标	实测值	阈值	状态
最大相对误差	0.1898%	< 1.00%	PASS
最大绝对误差	7.81e-03	-	-
CPU 推理时间	1.495s	-	-
NPU 推理时间	0.041s	-	-
加速比	36.86x	> 1x	PASS
分数一致性	完全一致	-	PASS

性能数据

操作	耗时
NPU 推理时间 (3 文档)	0.656s
精度测试 CPU 时间	1.495s
精度测试 NPU 时间	0.041s

重排序结果示例

查询: "Who wrote 'To Kill a Mockingbird'?"

排名	相关性分数	文档摘要
1	0.9946	'To Kill a Mockingbird' is a novel by Harper Lee...
2	0.9839	Harper Lee, an American novelist widely known...
3	0.5010	The novel 'Moby-Dick' was written by Herman Melville...

结果: 模型正确识别 Harper Lee 是《杀死一只知更鸟》的作者，相关性分数最高。

测试日志

============================================================
mxbai-rerank-xsmall-v1 NPU Test
Model: mixedbread-ai/mxbai-rerank-xsmall-v1
Output: /data/ysws/agentsp/5-16/mxbai-rerank-xsmall-v1-ascend
============================================================

============================================================
mxbai-rerank-xsmall-v1 Inference Test (NPU)
============================================================
Device: npu:0
Model: /data/ysws/agentsp/5-16/mxbai-rerank-xsmall-v1/mixedbread-ai/mxbai-rerank-xsmall-v1
Loading tokenizer...
Loading model...
Loading weights: 100%|██████████| 202/202 [00:00<00:00, 5362.81it/s]
Model loaded successfully
Query: Who wrote 'To Kill a Mockingbird'?
Documents: 3
Input shape: torch.Size([3, 48])
Logits shape: torch.Size([3, 1])
Scores: [0.99462890625, 0.5009765625, 0.98388671875]
Inference time: 0.656s
Reranked results:
  1. [score=0.9946] 'To Kill a Mockingbird' is a novel by Harper Lee published i...
  2. [score=0.9839] Harper Lee, an American novelist widely known for her novel ...
  3. [score=0.5010] The novel 'Moby-Dick' was written by Herman Melville and fir...

Inference result saved to /data/ysws/agentsp/5-16/mxbai-rerank-xsmall-v1-ascend/inference_result.json

============================================================
Precision Test (CPU vs NPU)
============================================================
Using device: npu:0
Loading tokenizer...
Loading model on CPU...
Loading weights: 100%|██████████| 202/202 [00:00<00:00, 4532.44it/s]
Loading model on npu:0...
Loading weights: 100%|██████████| 202/202 [00:00<00:00, 4531.76it/s]
Running inference on CPU...
Running inference on NPU...
CPU inference time: 1.495s
NPU inference time: 0.041s
Speedup: 36.86x
Max absolute error: 7.812500e-03
Max relative error: 0.1898% (threshold: 1.0%)
CPU score: 0.983887
NPU score: 0.983887
Scores match (atol=1e-4): True
Status: PASS

Precision result saved to /data/ysws/agentsp/5-16/mxbai-rerank-xsmall-v1-ascend/precision_result.json

============================================================
Creating Test Sample
============================================================
Saved test sample: /data/ysws/agentsp/5-16/mxbai-rerank-xsmall-v1-ascend/test_sample.txt

============================================================
Test Complete!
============================================================

Python API 使用示例

基本重排序

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_DIR = "/data/ysws/agentsp/5-16/mxbai-rerank-xsmall-v1/mixedbread-ai/mxbai-rerank-xsmall-v1"

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)

model = model.to("npu:0").eval()

query = "Who wrote 'To Kill a Mockingbird'?"
documents = [
    "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960.",
    "The novel 'Moby-Dick' was written by Herman Melville.",
    "Harper Lee wrote 'To Kill a Mockingbird' and was born in 1926."
]

pairs = [[query, doc] for doc in documents]
inputs = tokenizer(pairs, return_tensors="pt", padding=True, truncation=True, max_length=512)
inputs = {k: v.to("npu:0") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

scores = outputs.logits.squeeze(-1).sigmoid()
sorted_indices = torch.argsort(scores, descending=True).tolist()

for rank, idx in enumerate(sorted_indices, 1):
    print(f"{rank}. {documents[idx]} (score: {scores[idx].item():.4f})")

用于 RAG 系统

def rerank_documents(query, retrieved_docs, top_k=3):
    pairs = [[query, doc] for doc in retrieved_docs]
    inputs = tokenizer(pairs, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {k: v.to("npu:0") for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    scores = outputs.logits.squeeze(-1).sigmoid()
    sorted_indices = torch.argsort(scores, descending=True)[:top_k].tolist()

    return [(retrieved_docs[i], scores[i].item()) for i in sorted_indices]

模型结构

架构类型: DebertaV2ForSequenceClassification
编码器层数: 12
隐藏层维度: 384
注意力头数: 6
前馈网络维度: 1536
词汇表大小: 128100

组件	说明
embeddings	DebertaV2 词嵌入
encoder	12 层 Transformer 编码器
pooler	池化层输出分类logits
classifier	序列分类头 (输出相关性分数)

推理参数配置

从 config.json 提取的关键参数:

{
  "model_type": "deberta-v2",
  "hidden_size": 384,
  "num_hidden_layers": 12,
  "num_attention_heads": 6,
  "intermediate_size": 1536,
  "vocab_size": 128100,
  "max_position_embeddings": 512,
  "attention_probs_dropout_prob": 0.1,
  "hidden_dropout_prob": 0.1
}

常见问题

Q: 精度测试失败?

A: 检查 NPU 驱动是否正确安装。DebertaV2 模型在 CPU 和 NPU 上的输出几乎完全一致，误差极小 (0.19%)。

Q: 如何提高重排序速度?

A: 使用批处理可以显著提高吞吐量。NPU 推理非常快 (0.041s vs CPU 1.495s)。

Q: 模型支持多语言吗?

A: 本模型针对英语文档重排序。如需多语言支持，请访问 Mixedbread AI 查找其他模型。

参考链接

原始模型: https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1
Mixedbread AI: https://mixedbread.com
DebertaV2 论文: https://arxiv.org/abs/2006.03654
HuggingFace Transformers: https://huggingface.co/transformers

许可证

本项目遵循 Apache-2.0 许可证