mxbai-edge-colbert-v0-32M 是 mixedbread-ai 的 32M 参数 ColBERT 风格句子嵌入模型,可将句子映射到 384 维稠密向量空间,适用于语义搜索、聚类和相似度计算等任务。该模型基于 ModernBERT 架构,针对高效边缘部署进行了优化。
mxbai-edge-colbert-v0-32m-ascend/
├── inference.py # 推理测试脚本
├── log.txt # 测试日志
├── README.md # 本文档
├── test_sample.txt # 测试样本
├── inference_result.json # 推理结果
└── precision_result.json # 精度测试结果docker exec -it test-modelagent bashsource /usr/local/Ascend/ascend-toolkit/set_env.sh模型文件位于 /data/ysws/agentsp/5-18-2/mxbai-edge-colbert-v0-32m/mixedbread-ai/mxbai-edge-colbert-v0-32m/ 目录下:
pip install transformers torch_npucd /data/ysws/agentsp/5-18-2/mxbai-edge-colbert-v0-32m-ascend/
python3 inference.pycd /data/ysws/agentsp/5-18-2/mxbai-edge-colbert-v0-32m-ascend/
python3 inference.py precision_test| 指标 | 实测值 | 阈值 | 状态 |
|---|---|---|---|
| Cosine 相似度 | 1.000000 | ≈ 1.0 | PASS |
| 最大相对误差 | 0.0191% | < 1% | PASS |
| NPU 加速比 | 3.69x | > 3x | PASS |
| 操作 | 耗时 |
|---|---|
| CPU 推理时间 | 0.053s |
| NPU 推理时间 | 0.014s |
| 加速比 | 3.69x |
| 输入句子 | 嵌入维度 | 首 5 维值 |
|---|---|---|
| "Hello, how are you today?" | 384 | [0.0047, -0.0162, 0.0935, 0.0047, 0.0041] |
结果: CPU 和 NPU 输出的 cosine 相似度为 1.000000,几乎完全一致
完整测试日志如下:
============================================================
mxbai-edge-colbert-v0-32M NPU Test
Output: /data/ysws/agentsp/5-18-2/mxbai-edge-colbert-v0-32m-ascend
============================================================
============================================================
mxbai-edge-colbert-v0-32M Inference Test (NPU)
============================================================
Device: npu:0
Model: /data/ysws/agentsp/5-18-2/mxbai-edge-colbert-v0-32m/mixedbread-ai/mxbai-edge-colbert-v0-32m
Loading tokenizer...
Loading model...
Loading weights: 100%|██████████| 62/62 [00:00<00:00, 7313.31it/s]
Input text: ['Hello, how are you today?']
Input shape: torch.Size([1, 9])
Embedding shape: torch.Size([1, 384])
Embedding sample (first 5): [0.004708211403340101, -0.016210922971367836, 0.09353338181972504, 0.004691631533205509, 0.00407747644931078]
Inference time: 0.300s
============================================================
Precision Test (CPU vs NPU)
============================================================
Loading model on CPU...
Loading weights: 100%|██████████| 62/62 [00:00<00:00, 4594.87it/s]
Running inference on CPU...
Loading model on NPU...
Loading weights: 100%|██████████| 62/62 [00:00<00:00, 5485.29it/s]
Running inference on NPU...
CPU inference time: 0.053s
NPU inference time: 0.014s
Speedup: 3.69x
Cosine similarity: 1.000000
Max absolute error: 9.805104e-05
Max relative error: 1.911812e-04 (0.0191%)
Status: PASS
============================================================
Creating Test Sample
============================================================
Saved test sample
1. Hello, how are you today?
2. The weather is nice today.
3. I am very happy to see you.
============================================================
Test Complete!import torch
from transformers import AutoTokenizer, AutoModel
MODEL_DIR = "/data/ysws/agentsp/5-18-2/mxbai-edge-colbert-v0-32m/mixedbread-ai/mxbai-edge-colbert-v0-32m"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModel.from_pretrained(MODEL_DIR)
model = model.to("npu:0")
model.eval()
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["Hello, how are you today?"]
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True, max_length=128)
inputs = {k: v.to("npu:0") for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
embeddings = mean_pooling(outputs, inputs['attention_mask'])
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
print(f"Embeddings shape: {embeddings.shape}") # torch.Size([1, 384])from sklearn.metrics.pairwise import cosine_similarity
# 计算两个句子的相似度
emb1 = embeddings[0].cpu().numpy()
# ... (compute second embedding)
similarity = cosine_similarity([emb1], [emb2])[0][0]
print(f"Cosine similarity: {similarity:.4f}")sentences = [
"First sentence for embedding",
"Second sentence for comparison",
"Third sentence in the batch"
]
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=128)
inputs = {k: v.to("npu:0") for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
embeddings = mean_pooling(outputs, inputs['attention_mask'])
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
print(f"Batch embeddings shape: {embeddings.shape}") # torch.Size([3, 384])| 组件 | 说明 |
|---|---|
| embeddings | ModernBERT 词嵌入(vocab_size=50370) |
| encoder | 10 层 Transformer 编码器 |
| pooling | 对 token 嵌入进行均值池化 |
从 config.json 提取的关键参数:
{
"hidden_size": 384,
"intermediate_size": 576,
"num_attention_heads": 6,
"num_hidden_layers": 10,
"vocab_size": 50370,
"model_type": "modernbert",
"classifier_pooling": "mean"
}A: 检查 NPU 驱动是否正确安装,确保 CANN 环境变量已 source。
A: 使用批处理可以显著提高吞吐量。另外,首次推理会有编译开销,后续推理会更快。
A: 可用于:
本项目遵循 Apache-2.0 许可证