nemotron-speech-streaming-en-0.6b Ascend NPU 部署指南

项目简介

nemotron-speech-streaming-en-0.6b 是 NVIDIA 开发的流式语音识别模型，采用 FastConformer-CacheAware-RNNT 架构，600M 参数，支持低延迟流式英语转录。

特性

支持 Ascend NPU 推理加速
CPU vs NPU 精度对比测试 (< 1% 误差)
原生支持流式推理 (Cache-aware)
支持标点符号和大小写
LibriSpeech test-clean WER: 2.31%

环境要求

硬件: 华为 Ascend 910 系列 NPU
CANN: 8.0.RC1 或更高版本
PyTorch: 2.0+ with torch_npu
Docker: 容器名称 test-modelagent

目录结构

/data/ysws/agentsp/5-14/nemotron-speech-streaming-en-0.6b-ascend/
├── inference.py          # 精度测试脚本
├── log.txt               # 测试日志
├── README.md             # 本文档
├── test_audio_0.wav      # 测试音频样本
├── test_audio_1.wav
├── test_audio_2.wav
└── fusion_result.json    # 融合结果

部署步骤

1. 进入容器

docker exec -it test-modelagent bash

2. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

3. 准备模型文件

模型文件应放在 /data/ysws/agentsp/5-14/nemotron-speech-streaming-en-0.6b/ 目录下：

nemotron-speech-streaming-en-0.6b.nemo - NeMo 模型文件 (约 2.4GB)

4. 安装依赖

pip install nemo-toolkit nemo_toolkit_asr -i https://repo.huaweicloud.com/repository/pypi/simple/ --trusted-host repo.huaweicloud.com

使用方式

方式一：普通推理模式

运行推理脚本进行音频转录：

cd /data/ysws/agentsp/5-14/nemotron-speech-streaming-en-0.6b-ascend/

# 使用默认测试音频 (随机噪声)
python3 inference.py

# 使用指定音频文件
python3 inference.py --audio_path /path/to/your/audio.wav

# 指定设备
python3 inference.py --device npu:0

方式二：精度测试模式 (CPU vs NPU)

运行精度对比测试，验证 NPU 计算结果与 CPU 一致性：

cd /data/ysws/agentsp/5-14/nemotron-speech-streaming-en-0.6b-ascend/

# 运行完整精度测试
python3 inference.py --precision_test

# 指定测试张量数量
python3 inference.py --precision_test --num_tensors 20

# 组合使用
python3 inference.py --audio_path /path/to/audio.wav --precision_test

命令行参数说明

参数	说明	默认值
`--model_path`	模型文件路径	`/data/ysws/agentsp/5-14/nemotron-speech-streaming-en-0.6b/nemotron-speech-streaming-en-0.6b.nemo`
`--audio_path`	输入音频文件路径 (16kHz WAV)	自动生成测试音频
`--device`	运行设备	`npu:0`
`--precision_test`	运行精度测试模式	`False`
`--num_tensors`	精度测试的张量数量	`20`

测试验证

精度测试结果

指标	实测值	阈值	状态
Max error (sum)	1.95e-03	< 1.00e+00	PASS
Max error (mean)	2.98e-08	< 1.00e-04	PASS
Max error (std)	5.96e-08	< 1.00e-03	PASS

性能数据

操作	耗时
模型加载	25.37s
推理时间 (1s 音频)	8.29s
CPU 参考计算 (20 tensors)	17.66s
NPU 张量读取 (20 tensors)	0.08s

测试日志

完整测试日志保存在 log.txt

Python API 使用示例

基本推理

import torch
import numpy as np
from nemo.collections.asr.models import EncDecRNNTBPEModel

model_path = "/data/ysws/agentsp/5-14/nemotron-speech-streaming-en-0.6b/nemotron-speech-streaming-en-0.6b.nemo"

model = EncDecRNNTBPEModel.restore_from(model_path)
model = model.to("npu:0")
model.eval()

audio = np.random.randn(16000).astype(np.float32) * 0.01
audio_tensor = torch.from_numpy(audio).float().to("npu:0")

with torch.no_grad():
    hypothesis = model.transcribe(audio_tensor)[0]

transcription = hypothesis.text
print(f"Transcription: {transcription}")

使用真实音频文件

import scipy.io.wavfile as wavfile

sample_rate, audio_data = wavfile.read("your_audio.wav")
audio_data = audio_data.astype(np.float32) / 32768.0

audio_tensor = torch.from_numpy(audio_data).float().to("npu:0")

with torch.no_grad():
    hypothesis = model.transcribe(audio_tensor)[0]

print(f"Transcription: {hypothesis.text}")

音频要求

采样率: 16kHz
格式: WAV (单声道)
时长: 至少 80ms

模型结构

架构类型: FastConformer-CacheAware-RNNT
编码器: 24 层 Cache-Aware FastConformer
解码器: RNNT (Recurrent Neural Network Transducer)
参数量: 600M
分词器: SentencePieceTokenizer (1024 tokens)

组件	说明
preprocessor	音频预处理 (fbank 特征)
encoder	FastConformer 编码器 (24层)
decoder	RNNT 解码器
joint	RNNT 联合网络

张量精度详情

张量名称	Sum Error	Mean Error	Std Error
preprocessor.featurizer.window	1.53e-05	2.98e-08	0.00e+00
preprocessor.featurizer.fb	0.00e+00	1.46e-11	1.16e-10
encoder.pre_encode.out.weight	6.10e-05	0.00e+00	0.00e+00
encoder.layers.0.feed_forward1.linear1.weight	1.95e-03	4.66e-10	0.00e+00

常见问题

Q: 精度测试失败?

A: 检查 NPU 驱动是否正确安装，确保 CANN 环境变量已 source。

Q: 推理输出的转录是空的?

A: 这是预期行为。使用随机噪声时模型无法识别出有意义的文本。使用真实音频输入可获得正确的转录。

Q: 如何提高识别效果?

A: 确保输入音频采样率为 16kHz，音频质量清晰无噪声。

参考链接

原始模型: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b
NeMo 框架: https://github.com/NVIDIA/NeMo
流式推理脚本: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py

许可证

本项目遵循 NVIDIA Open Model License.