BAAI-llm-embedder on Ascend NPU

1. 简介

本文档记录 BAAI/llm-embedder 句嵌入模型在昇腾 NPU（Ascend 910B3）上的迁移适配、精度评测与性能验证结果。

llm-embedder 是 BAAI（北京智源人工智能研究院）推出的句嵌入模型，基于 BERT 架构（12 层 Transformer，768 维隐藏层），使用 CLS Token Pooling + L2 归一化提取句表示。与常见的 Mean Pooling 不同，该模型仅取 [CLS] 位置的输出作为整个句子的嵌入表示。

2. 验证环境

组件	版本
`torch`	`2.8.0`
`torch_npu`	`2.8.0.post4`
`transformers`	`5.8.1`
`sentence-transformers`	`5.5.0`
`CANN`	`8.5.1`

NPU：8 × Ascend 910B3
精度对比基准：CPU（x86, PyTorch 2.8.0）

3. 部署使用流程

3.1 环境准备

conda create -n BAAI--llm-embedder python=3.11 -y
conda activate BAAI--llm-embedder

pip install torch==2.8.0 torch_npu==2.8.0.post4 \
    -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install transformers sentence-transformers numpy \
    -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2 模型权重下载

HF_ENDPOINT=https://hf-mirror.com \
    huggingface-cli download BAAI/llm-embedder \
    --local-dir ./BAAI--llm-embedder

3.3 推理脚本使用

python inference.py --text "Hello world" --device npu
python inference.py --batch_file texts.txt --device npu --batch_size 32

编程接口：

from inference import LLMEmbedder

encoder = LLMEmbedder(model_path="./BAAI--llm-embedder", device="npu")
embeddings = encoder.encode(["Hello world", "Another sentence"])
# embeddings.shape → (2, 768)

4. Smoke 验证

python inference.py --text "This is a test sentence." --device npu

预期输出：嵌入维度 768，无运行时错误。

5. 性能参考

测试条件：23 条多样化测试句子，batch_size=32。

指标	数值
NPU 吞吐量	`930.6` sentences/s

6. 精度评测

6.1 评测方法

分别在 CPU 和 NPU 上对 23 条多样化测试句子（中英文、短句、长句、特殊字符）推理，计算 NPU 与 CPU 嵌入的余弦相似度和语义相似度矩阵的 Pearson 相关系数。

6.2 评测结果

指标	数值
精度误差率	`0.0004%`
评测结果	PASS

结论：精度误差率 0.0004%，远低于 1% 要求，评测通过。

7. 迁移适配说明

7.1 模型结构

llm-embedder 由三个子模块组成：

Transformer（BERT-base，12 层，768 维）：文本编码
Pooling（CLS Token Pooling）：取 [CLS] 位置的输出作为句表示
Normalize（L2 归一化）：将嵌入向量归一化到单位球面

7.2 适配要点

该模型为标准 BERT 架构，NPU 适配无需修改模型结构：

使用 AutoModel.from_pretrained() 加载，model.to("npu:0") 迁移
关键差异：使用 CLS Token Pooling（而非 Mean Pooling），取 model_output[0][:, 0, :]
输入通过 AutoTokenizer 在 CPU 分词后转移至 NPU
输出通过 .cpu().numpy() 转回 CPU

7.3 CLS vs Mean Pooling

# CLS Token Pooling（本模型使用）
cls_embedding = model_output[0][:, 0, :]  # shape: (batch, 768)

# Mean Pooling（常见但本模型不使用）
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mean_embedding = (embeddings * mask).sum(1) / mask.sum(1)

8. 注意事项

CLS Token Pooling：该模型使用 CLS token 而非 Mean Pooling，与其他 sentence-transformers 模型（如 all-MiniLM-L6-v2）不同，混用会导致精度下降
NPU 预热：首次推理触发算子编译（约 3-5 秒），建议先执行一次预热
最大序列长度：默认 max_seq_length=512，与 sentence_bert_config.json 配置一致
Tokenizer 依赖：需要 tokenizer.json 和 vocab.txt，权重下载时确保包含这些文件