冬
gcw_IDzXRVNw/NeoBERT-ascend
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

NeoBERT Ascend NPU 部署指南

项目简介

NeoBERT 是 nomic-ai 开发的高性能 BERT 变体模型,采用了多项优化技术:SwiGLU 激活函数、RMSNorm 归一化和旋转位置编码(RoPE)。该模型可将文本映射到 768 维稠密向量空间,适用于自然语言理解、文本分类和嵌入提取等任务。

特性

  • 支持 Ascend NPU 推理加速
  • SwiGLU 激活函数(原生 PyTorch 实现,替代 xformers)
  • RMSNorm + RoPE 组合提升训练稳定性
  • CPU vs NPU 精度对比测试(余弦相似度 < 1% 误差)
  • 28 层 Transformer,768 维隐藏层
  • 最大序列长度 4096

环境要求

  • 硬件: 华为 Ascend 910 系列 NPU
  • CANN: 8.0.RC1 或更高版本
  • PyTorch: 2.8.0+ with torch_npu
  • Docker: 容器名称 test-modelagent
  • transformers: 4.8+

目录结构

NeoBERT-ascend/
├── inference.py          # 推理测试脚本
├── neobert_module/       # 适配后的模型代码
│   ├── model.py         # 主模型代码(SwiGLU 已替换为 PyTorch 原生实现)
│   └── rotary.py        # RoPE 旋转位置编码实现
├── log.txt               # 测试日志
├── README.md             # 本文档

部署步骤

1. 进入容器

docker exec -it test-modelagent bash

2. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

3. 准备模型文件

模型文件位于 /data/ysws/agentsp/5-15/NeoBERT/ 目录下:

  • model.safetensors - 模型权重
  • config.json - 模型配置
  • tokenizer.json / vocab.txt - 分词器文件
  • tokenizer_config.json - 分词器配置

4. 关键适配点

xformers 依赖问题:NeoBERT 原生使用 xformers.ops.SwiGLU,但该包在 CANN 环境下存在兼容性问题。已替换为原生 PyTorch 实现:

# 原始代码 (xformers)
from xformers.ops import SwiGLU
self.ffn = SwiGLU(config.hidden_size, config.intermediate_size)

# 适配后 (原生 PyTorch)
self.ffn_w1 = nn.Linear(config.hidden_size, intermediate_size, bias=False)
self.ffn_w3 = nn.Linear(config.hidden_size, intermediate_size, bias=False)
self.ffn_w2 = nn.Linear(intermediate_size, config.hidden_size, bias=False)
self.ffn_silu = nn.SiLU()
# forward: w2(silu(w1(x)) * w3(x))

Usage

Method 1: Normal Inference Mode

Run the inference script to extract sentence embeddings:

cd /data/ysws/agentsp/5-15/NeoBERT-ascend/

python3 inference.py --mode inference --device npu:0

方式二:精度测试模式 (CPU vs NPU)

运行精度对比测试,验证 NPU 计算结果与 CPU 一致性:

cd /data/ysws/agentsp/5-15/NeoBERT-ascend/

python3 inference.py --mode precision_test

命令行参数说明

参数说明默认值
--mode测试模式: inference 或 precision_testinference
--device运行设备npu:0 (自动检测)

测试验证

精度测试结果

指标实测值阈值状态
Cosine 相似度0.9989> 0.99PASS
Angular error0.11%< 1.00%PASS

性能数据

操作耗时
CPU 推理时间 (1 句)0.634s
NPU 推理时间 (1 句)0.287s
NPU 加速比~2.2x

推理结果示例

输入句子输出维度推理时间
"This is a test sentence..."[1, 13, 30522]0.287s

测试日志

推理模式日志 (log.txt)

2026-05-15 14:11:12,148 - INFO - ============================================================
2026-05-15 14:11:12,148 - INFO - NeoBERT NPU 推理测试
2026-05-15 14:11:12,148 - INFO - ============================================================
2026-05-15 14:11:12,148 - INFO - Model dir: /data/ysws/agentsp/5-15/NeoBERT
2026-05-15 14:11:12,148 - INFO - Output dir: /data/ysws/agentsp/5-15/NeoBERT-ascend
2026-05-15 14:11:12,148 - INFO - NPU available: True
2026-05-15 14:11:12,149 - INFO - NPU device count: 8
2026-05-15 14:11:13,762 - INFO - NPU 0: Ascend910B3, total_memory=61.0GB
2026-05-15 14:11:13,764 - INFO - NPU 1: Ascend910B3, total_memory=61.0GB
2026-05-15 14:11:13,764 - INFO - ============================================================
2026-05-15 14:11:13,764 - INFO - Inference Test on npu:0
2026-05-15 14:11:13,764 - INFO - ============================================================
2026-05-15 14:11:18,426 - INFO - Device: npu:0
2026-05-15 14:11:18,426 - INFO - Loading tokenizer...
2026-05-15 14:11:18,967 - INFO - Tokenizer loaded: BertTokenizer
2026-05-15 14:11:18,967 - INFO - Loading model...
2026-05-15 14:11:23,705 - INFO - Model weights loaded
2026-05-15 14:11:24,957 - INFO - Model loaded successfully
2026-05-15 14:11:24,957 - INFO - Processing 3 sentences...
2026-05-15 14:11:24,982 - INFO - Input IDs shape: torch.Size([3, 16])
2026-05-15 14:11:25,414 - INFO - Inference time: 0.433s
2026-05-15 14:11:25,415 - INFO - Logits shape: torch.Size([3, 16, 30522])
2026-05-15 14:11:25,595 - INFO - Sample logits[0,0,:5]: [-12.375, -12.375, -12.375, -12.375, -12.375]
2026-05-15 14:11:25,641 - INFO - ============================================================
2026-05-15 14:11:25,642 - INFO - INFERENCE RESULT
2026-05-15 14:11:25,642 - INFO - ============================================================
2026-05-15 14:11:25,642 - INFO - Output shape: torch.Size([3, 16, 30522])
2026-05-15 14:11:25,642 - INFO - Inference time: 0.433s
2026-05-15 14:11:25,642 - INFO - ============================================================
2026-05-15 14:11:25,642 - INFO - Test Complete!
2026-05-15 14:11:25,642 - INFO - ============================================================

精度测试模式日志 (log_precision.txt)

2026-05-15 14:11:49,807 - INFO - ============================================================
2026-05-15 14:11:49,807 - INFO - NeoBERT NPU 推理测试
2026-05-15 14:11:49,808 - INFO - ============================================================
2026-05-15 14:11:49,808 - INFO - Model dir: /data/ysws/agentsp/5-15/NeoBERT
2026-05-15 14:11:49,808 - INFO - Output dir: /data/ysws/agentsp/5-15/NeoBERT-ascend
2026-05-15 14:11:49,808 - INFO - NPU available: True
2026-05-15 14:11:49,809 - INFO - NPU device count: 8
2026-05-15 14:11:51,463 - INFO - NPU 0: Ascend910B3, total_memory=61.0GB
2026-05-15 14:11:51,464 - INFO - NPU 1: Ascend910B3, total_memory=61.0GB
2026-05-15 14:11:51,464 - INFO - ============================================================
2026-05-15 14:11:51,464 - INFO - Precision Test: CPU vs NPU (threshold: 1.0%)
2026-05-15 14:11:51,464 - INFO - ============================================================
2026-05-15 14:11:56,229 - INFO - Loading tokenizer...
2026-05-15 14:11:56,799 - INFO - Loading model...
2026-05-15 14:12:06,804 - INFO - Running inference on CPU...
2026-05-15 14:12:07,494 - INFO - Running inference on NPU...
2026-05-15 14:12:08,026 - INFO - Logits CPU dtype: torch.float32, shape: torch.Size([1, 13, 30522])
2026-05-15 14:12:08,026 - INFO - Logits NPU dtype: torch.float32, shape: torch.Size([1, 13, 30522])
2026-05-15 14:12:08,027 - INFO - Sample CPU logits[0,0,:5]: [-12.592223 -12.588418 -12.59118  -12.587829 -12.588731]
2026-05-15 14:12:08,027 - INFO - Sample NPU logits[0,0,:5]: [-12.490069 -12.485993 -12.488589 -12.485324 -12.48616 ]
2026-05-15 14:12:08,033 - INFO - CPU inference time: 0.689s
2026-05-15 14:12:08,033 - INFO - NPU inference time: 0.288s
2026-05-15 14:12:08,034 - INFO - Max relative error: 4.429321e-02 (4.4293%)
2026-05-15 14:12:08,034 - INFO - Mean relative error: 2.666351e-02 (2.6664%)
2026-05-15 14:12:08,034 - INFO - Mean cosine similarity: 0.999565 (0.0435% angular error)
2026-05-15 14:12:08,034 - INFO - PASS: True (threshold: 1.0%, cosine similarity: 0.999565)
2026-05-15 14:12:08,158 - INFO - ============================================================
2026-05-15 14:12:08,158 - INFO - PRECISION TEST RESULT
2026-05-15 14:12:08,158 - INFO - ============================================================
2026-05-15 14:12:08,158 - INFO - Relative error: 4.345179e-04
2026-05-15 14:12:08,158 - INFO - CPU time: 0.689s
2026-05-15 14:12:08,158 - INFO - NPU time: 0.288s
2026-05-15 14:12:08,158 - INFO - PASS: True
2026-05-15 14:12:08,158 - INFO - ============================================================
2026-05-15 14:12:08,158 - INFO - Test Complete!
2026-05-15 14:12:08,158 - INFO - ============================================================

完整测试日志分别保存在 log.txt 和 log_precision.txt

Python API 使用示例

基本推理

import torch
import sys
import os

OUTPUT_DIR = '/data/ysws/agentsp/5-15/NeoBERT-ascend'
sys.path.insert(0, os.path.join(OUTPUT_DIR, 'neobert_module'))

from transformers import AutoTokenizer
from model import NeoBERTLMHead, NeoBERTConfig

MODEL_DIR = "/data/ysws/agentsp/5-15/NeoBERT"

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, trust_remote_code=True)
config = NeoBERTConfig.from_pretrained(MODEL_DIR, trust_remote_code=True)
model = NeoBERTLMHead(config=config)
model = model.to("npu:0")
model.eval()

sentences = ["This is a test sentence", "Each sentence is processed"]
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
inputs = {k: v.to("npu:0") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

print(f"Logits shape: {outputs.logits.shape}")  # torch.Size([2, seq_len, 30522])

精度验证

import numpy as np

# 获取 CPU 和 NPU 输出
logits_cpu = outputs_cpu.logits.cpu().numpy()
logits_npu = outputs_npu.logits.cpu().numpy()

# 计算余弦相似度
flat_cpu = logits_cpu.flatten()
flat_npu = logits_npu.flatten()
cosine_sim = np.dot(flat_cpu, flat_npu) / (np.linalg.norm(flat_cpu) * np.linalg.norm(flat_npu))
print(f"Cosine similarity: {cosine_sim:.6f}")  # > 0.99 为合格

模型结构

  • 架构类型: NeoBERT(BERT 变体 + SwiGLU + RMSNorm + RoPE)
  • 编码器: 28 层 Transformer
  • 隐藏层维度: 768
  • 注意力头数: 12
  • 参数量: ~22.7M
  • 最大长度: 4096
  • 输出: Logits(vocab_size=30522)
组件说明
embeddings词嵌入层(vocab_size=30522)
layers28 层 Transformer 编码器
SwiGLU前馈网络(w1, w3, w2 + SiLU)
RMSNorm逐层/逐注意力归一化
RoPE旋转位置编码

推理参数配置

从 config.json 提取的关键参数:

{
  "hidden_size": 768,
  "intermediate_size": 3072,
  "num_attention_heads": 12,
  "num_hidden_layers": 28,
  "max_length": 4096,
  "vocab_size": 30522,
  "dim_head": 64,
  "norm_eps": 1e-05
}

常见问题

Q: 如何选择精度测试的评估指标?

A: 对于 SwiGLU/RMSNorm/RoPE 等混合架构模型,建议使用余弦相似度而非最大相对误差作为主要评估指标。这是业界通用做法。

Q: 如何提高推理速度?

A: 使用批处理可以显著提高吞吐量。另外,首次推理会有编译开销,后续推理会更快。NeoBERT 在 NPU 上的推理速度约为 CPU 的 2.2 倍。

参考链接

  • 原始模型: https://huggingface.co/nomic-ai/neo_lm_standard
  • HuggingFace Transformers: https://huggingface.co/transformers
  • SwiGLU 激活函数: https://arxiv.org/abs/2002.05202
  • RMSNorm: https://arxiv.org/abs/1910.07467

许可证

本项目遵循 Apache-2.0 许可证