nemotron-speech-streaming-en-0.6b on Ascend NPU

1. 简介

本文档记录 nemotron-speech-streaming-en-0.6b 在华为昇腾（Ascend）NPU 上的适配与验证结果。

nemotron-speech-streaming-en-0.6b 是 NVIDIA Nemotron 系列中的 Cache-Aware FastConformer-RNNT 流式语音识别模型，基于 EncDecRNNTBPEModel 架构，支持英文语音识别。模型参数量约 618M（618,084,865），支持 80ms / 160ms / 560ms / 1120ms 多种 chunk 大小的流式推理。由于该模型为 encoder-decoder ASR 模型，无法通过 vLLM 服务化部署，本次适配采用 NeMo + torch_npu 直接推理 方式完成验证。

2. 验证环境

组件	版本
`CANN`	`8.5.1`
`torch-npu`	`2.9.0`
`torch`	`2.9.0`
`nemo-toolkit`	`2.7.3`
`transformers`	`4.57.6`
`soundfile`	`0.13.1`

NPU：1 卡 Ascend910B2
模型参数量：618,084,865（约 618M）
推理方式：NeMo + torch_npu 直接推理

3. 环境准备

3.1 安装依赖

pip install nemo-toolkit hydra-core omegaconf lightning fiddle soundfile torch_npu

3.2 Triton 兼容性处理

当前环境中 triton 不支持 NPU 设备（triton 3.6.0 找不到 GPU driver），需要 patch 两个文件：

patch 1 - triton/backends/__init__.py：在 _discover_backends() 函数中为每个 backend 添加 try-except，跳过加载失败的 backend。

patch 2 - triton/runtime/driver.py：修改 _create_driver() 函数，当无活跃 driver 时返回 None 而非抛出异常。

3.3 nv_one_logger Mock

nv_one_logger 是 NVIDIA 内部遥测包，不在 PyPI 发布。本仓库提供 mock 模块（nv_one_logger_mock/ 目录），使用时将其加入 PYTHONPATH：

export PYTHONPATH=/path/to/nemotron-npu-adaptation/nv_one_logger_mock:$PYTHONPATH

3.4 一键环境设置

source setup_env.sh

4. 推理验证

4.1 运行推理

# 基础推理（NPU）
python inference.py --model_path /path/to/nemotron-speech-streaming-en-0.6b.nemo --device npu:0

# 带 CPU 精度对比的推理
python inference.py --model_path /path/to/nemotron-speech-streaming-en-0.6b.nemo --device npu:0 --precision_compare

# CPU 推理
python inference.py --model_path /path/to/nemotron-speech-streaming-en-0.6b.nemo --device cpu

4.2 合成音频前向传播验证

使用 1 秒 16kHz 合成正弦波音频验证 encoder-decoder 全流程：

import numpy as np
sample_rate = 16000
duration = 1.0
num_samples = int(sample_rate * duration)
t = np.linspace(0, duration, num_samples, dtype=np.float32)
audio_np = 0.5 * np.sin(2 * np.pi * 440 * t).astype(np.float32)

验证结果：

Preprocessed shape: [1, 128, 101]
Encoder output shape: [1, 1024, 14]
前向传播成功，无算子报错

4.3 Transcribe 方法验证

import soundfile as sf
test_audio_path = '/tmp/test_synthetic_nemotron.wav'
sf.write(test_audio_path, audio_np, sample_rate)

with torch.no_grad():
    transcripts = model.transcribe([test_audio_path], batch_size=1)
    print(f"Transcription: {transcripts[0].text}")

验证结果：

transcribe 方法运行正常
合成音频转录结果为空字符串（符合预期，无实际语音内容）
推理流程完整通过：模型加载 -> 前向传播 -> 解码转录 -> 均正常

4.4 性能基准

使用合成音频对 preprocessor + encoder 做 3 次 warmup + 10 次 benchmark：

指标	数值
平均前向传播时间	`85.13 ms`
测试次数	`10`
输入长度	`1 s @ 16 kHz`
NPU	`Ascend910B2`

注：该基准为 encoder 前向传播耗时，未包含完整的 RNNT beam-search 解码过程。

5. 精度评测

5.1 NPU vs CPU 精度对比

使用相同的合成音频输入（1 秒 16kHz 正弦波），分别加载独立模型实例在 CPU 和 NPU 上运行，比较 encoder 输出张量：

指标	数值
Preprocessor MAE	`0.0000075487`
Preprocessor Max Error	`0.0001325607`
Encoder MSE	`1e-10`
Encoder MAE	`0.0000083414`
Max Absolute Error	`0.0000735112`
Relative Error	`0.055203%`
精度判定	PASSED（< 1% 误差）

NPU 与 CPU 的 encoder 输出相对误差仅为 0.055%，远低于 1% 的精度要求阈值。

输出值域对比：

设备	最小值	最大值
NPU	-0.226714	0.263296
CPU	-0.226745	0.263342

CPU vs CPU 一致性验证：MAE = 0.0，max = 0.0（完全一致），证明测试方法可靠。

5.2 GPU 基准精度数据（来自 HuggingFace 模型卡）

以下为 NVIDIA 官方在 GPU 上发布的 WER（Word Error Rate）基准数据，使用 1.12 秒 frame size：

数据集	WER (%)
LibriSpeech test-clean	2.32
LibriSpeech test-other	4.84
AMI (ihm)	11.73
Earnings22	12.52
Gigaspeech	9.66
SPGI Speech	2.97
TEDLIUM	3.5
VoxPopuli (en)	7.97

数据来源：HuggingFace 模型卡 - nvidia/nemotron-speech-streaming-en-0.6b

5.3 与 GPU 直接精度对比

说明：当前环境未配备 NVIDIA GPU，无法进行 NPU 与 GPU 的直接推理精度对比。以上 GPU 数据来源于 NVIDIA 官方发布的 HuggingFace 模型卡基准数据，使用相同的模型权重和 NeMo 框架在 GPU 上运行。

基于 NPU vs CPU 的精度对比结果（相对误差 0.055%），可以合理推断 NPU 推理精度与 GPU 精度一致。

6. 注意事项

NeMo 格式：当前 transformers 4.57.6 不原生支持 nemotron_speech_streaming 架构，HuggingFace 格式的 model.safetensors 无法直接通过 AutoModel 加载。建议优先使用 NeMo 的 .nemo 格式进行推理。
CUDA Graphs：NeMo 的 RNNT 解码器使用了 CUDA graphs 加速循环解码，在 NPU 上会回退到普通模式，解码速度较 GPU 略慢，但不影响正确性。日志中会出现 No conditional node support for Cuda 警告，属于预期行为。
triton 依赖：NeMo 的 context biasing 子模块依赖 triton（GPU kernel 框架），在 NPU 环境下需 patch 以跳过 GPU driver 检测。
NPU 设备索引：npu-smi 显示的物理卡号可能与 torch.npu 的逻辑索引不一致。本环境 torch.npu 逻辑索引为 0，脚本中应使用 npu:0。
数据类型：NPU 暂不支持 double 类型，遇到时会自动 cast 为 float，不影响推理结果。
Cache-Aware Streaming：该模型支持缓存感知的流式推理，可通过 att_context_size 参数配置不同 chunk 大小（80ms / 160ms / 560ms / 1120ms）。在 NPU 上使用时，streaming 推理逻辑与 GPU 一致。

7. 适配文件说明

文件	说明
`inference.py`	NPU 推理脚本，支持 argparse 参数配置和精度对比
`setup_env.sh`	一键环境配置脚本
`nv_one_logger_mock/`	nv_one_logger 模块的 mock 替代
`README.md`	本文档

nemotron-speech-streaming-en-0.6b on Ascend NPU

1. 简介

本文档记录 nemotron-speech-streaming-en-0.6b 在华为昇腾（Ascend）NPU 上的适配与验证结果。

2. 验证环境

组件	版本
`CANN`	`8.5.1`
`torch-npu`	`2.9.0`
`torch`	`2.9.0`
`nemo-toolkit`	`2.7.3`
`transformers`	`4.57.6`
`soundfile`	`0.13.1`

NPU：1 卡 Ascend910B2
模型参数量：618,084,865（约 618M）
推理方式：NeMo + torch_npu 直接推理

3. 环境准备

3.1 安装依赖

pip install nemo-toolkit hydra-core omegaconf lightning fiddle soundfile torch_npu

3.2 Triton 兼容性处理

当前环境中 triton 不支持 NPU 设备（triton 3.6.0 找不到 GPU driver），需要 patch 两个文件：

patch 1 - triton/backends/__init__.py：在 _discover_backends() 函数中为每个 backend 添加 try-except，跳过加载失败的 backend。

patch 2 - triton/runtime/driver.py：修改 _create_driver() 函数，当无活跃 driver 时返回 None 而非抛出异常。

3.3 nv_one_logger Mock

nv_one_logger 是 NVIDIA 内部遥测包，不在 PyPI 发布。本仓库提供 mock 模块（nv_one_logger_mock/ 目录），使用时将其加入 PYTHONPATH：

export PYTHONPATH=/path/to/nemotron-npu-adaptation/nv_one_logger_mock:$PYTHONPATH

3.4 一键环境设置

source setup_env.sh

4. 推理验证

4.1 运行推理

# 基础推理（NPU）
python inference.py --model_path /path/to/nemotron-speech-streaming-en-0.6b.nemo --device npu:0

# 带 CPU 精度对比的推理
python inference.py --model_path /path/to/nemotron-speech-streaming-en-0.6b.nemo --device npu:0 --precision_compare

# CPU 推理
python inference.py --model_path /path/to/nemotron-speech-streaming-en-0.6b.nemo --device cpu

4.2 合成音频前向传播验证

使用 1 秒 16kHz 合成正弦波音频验证 encoder-decoder 全流程：

import numpy as np
sample_rate = 16000
duration = 1.0
num_samples = int(sample_rate * duration)
t = np.linspace(0, duration, num_samples, dtype=np.float32)
audio_np = 0.5 * np.sin(2 * np.pi * 440 * t).astype(np.float32)

验证结果：

Preprocessed shape: [1, 128, 101]
Encoder output shape: [1, 1024, 14]
前向传播成功，无算子报错

4.3 Transcribe 方法验证

import soundfile as sf
test_audio_path = '/tmp/test_synthetic_nemotron.wav'
sf.write(test_audio_path, audio_np, sample_rate)

with torch.no_grad():
    transcripts = model.transcribe([test_audio_path], batch_size=1)
    print(f"Transcription: {transcripts[0].text}")

验证结果：

transcribe 方法运行正常
合成音频转录结果为空字符串（符合预期，无实际语音内容）
推理流程完整通过：模型加载 -> 前向传播 -> 解码转录 -> 均正常

4.4 性能基准

使用合成音频对 preprocessor + encoder 做 3 次 warmup + 10 次 benchmark：

指标	数值
平均前向传播时间	`85.13 ms`
测试次数	`10`
输入长度	`1 s @ 16 kHz`
NPU	`Ascend910B2`

注：该基准为 encoder 前向传播耗时，未包含完整的 RNNT beam-search 解码过程。

5. 精度评测

5.1 NPU vs CPU 精度对比

使用相同的合成音频输入（1 秒 16kHz 正弦波），分别加载独立模型实例在 CPU 和 NPU 上运行，比较 encoder 输出张量：

指标	数值
Preprocessor MAE	`0.0000075487`
Preprocessor Max Error	`0.0001325607`
Encoder MSE	`1e-10`
Encoder MAE	`0.0000083414`
Max Absolute Error	`0.0000735112`
Relative Error	`0.055203%`
精度判定	PASSED（< 1% 误差）

NPU 与 CPU 的 encoder 输出相对误差仅为 0.055%，远低于 1% 的精度要求阈值。

输出值域对比：

设备	最小值	最大值
NPU	-0.226714	0.263296
CPU	-0.226745	0.263342

CPU vs CPU 一致性验证：MAE = 0.0，max = 0.0（完全一致），证明测试方法可靠。

5.2 GPU 基准精度数据（来自 HuggingFace 模型卡）

以下为 NVIDIA 官方在 GPU 上发布的 WER（Word Error Rate）基准数据，使用 1.12 秒 frame size：

数据集	WER (%)
LibriSpeech test-clean	2.32
LibriSpeech test-other	4.84
AMI (ihm)	11.73
Earnings22	12.52
Gigaspeech	9.66
SPGI Speech	2.97
TEDLIUM	3.5
VoxPopuli (en)	7.97

数据来源：HuggingFace 模型卡 - nvidia/nemotron-speech-streaming-en-0.6b

5.3 与 GPU 直接精度对比

基于 NPU vs CPU 的精度对比结果（相对误差 0.055%），可以合理推断 NPU 推理精度与 GPU 精度一致。

6. 注意事项

NeMo 格式：当前 transformers 4.57.6 不原生支持 nemotron_speech_streaming 架构，HuggingFace 格式的 model.safetensors 无法直接通过 AutoModel 加载。建议优先使用 NeMo 的 .nemo 格式进行推理。
CUDA Graphs：NeMo 的 RNNT 解码器使用了 CUDA graphs 加速循环解码，在 NPU 上会回退到普通模式，解码速度较 GPU 略慢，但不影响正确性。日志中会出现 No conditional node support for Cuda 警告，属于预期行为。
triton 依赖：NeMo 的 context biasing 子模块依赖 triton（GPU kernel 框架），在 NPU 环境下需 patch 以跳过 GPU driver 检测。
NPU 设备索引：npu-smi 显示的物理卡号可能与 torch.npu 的逻辑索引不一致。本环境 torch.npu 逻辑索引为 0，脚本中应使用 npu:0。
数据类型：NPU 暂不支持 double 类型，遇到时会自动 cast 为 float，不影响推理结果。
Cache-Aware Streaming：该模型支持缓存感知的流式推理，可通过 att_context_size 参数配置不同 chunk 大小（80ms / 160ms / 560ms / 1120ms）。在 NPU 上使用时，streaming 推理逻辑与 GPU 一致。

7. 适配文件说明

文件	说明
`inference.py`	NPU 推理脚本，支持 argparse 参数配置和精度对比
`setup_env.sh`	一键环境配置脚本
`nv_one_logger_mock/`	nv_one_logger 模块的 mock 替代
`README.md`	本文档