gcw_AVRCax4T/wenet-u2pp-conformer-wenetspeech-onnx-online-20220506

WeNet U2++ Conformer（WenetSpeech）- 昇腾NPU适配

模型说明

本仓库包含WeNet U2++ Conformer流式语音识别模型的NPU适配版本，该模型最初基于WenetSpeech数据集训练并导出为ONNX格式。

模型：wenet-u2pp-conformer-wenetspeech-onnx-online-20220506
架构：U2++ Conformer（Unified Two-pass with bidirectional encoder + CTC + attention decoder）
任务：中文普通话自动语音识别（ASR）
类型：流式（在线）语音识别
框架：ONNX → torch_npu（昇腾NPU）
词汇表：5,538个token（汉字 + SentencePiece子词）

NPU适配总结

指标	结果
NPU平台	Ascend910（2x NPU）
CANN版本	8.5.1
适配方法	onnx2torch → torch_npu
编码器MAPE	0.8653% ✅（< 1%）
CTC MAPE	0.0213% ✅（< 1%）
解码器MAPE	0.0034% ✅（< 1%）
token准确率	100.00%（argmax匹配）
余弦相似度	>0.999999（所有模型）

快速开始

环境要求

pip install torch torch_npu torchaudio onnx onnxruntime onnx2torch soundfile scipy

下载模型

pip install modelscope
modelscope download --model manyeyes/wenet-u2pp-conformer-wenetspeech-onnx-online-20220506 --local_dir ./model

推理

# CPU inference (onnxruntime baseline)
python inference.py --backend cpu --wav test_wavs/0.wav

# NPU inference (torch_npu via onnx2torch)
python inference.py --backend npu --wav test_wavs/0.wav

# Evaluate NPU vs CPU accuracy
python evaluate_npu.py

示例输出

测试音频 0.wav（3.55秒）：

Greedy text:  朱立楠在上书战的上
Beam+Attn:    朱立楠在上书战的上
RTF (CPU):    0.2462
RTF (NPU):    0.2833

架构

Audio (16kHz) → FBank [T, 80] → Streaming Conformer Encoder
                                     ↓
                              [T_enc, 512]
                                ↙        ↘
                    CTC [T, 5538]    Attention Decoder
                         ↓                 ↓
                   CTC Greedy      Decoder Rescore
                         ↓                 ↓
                       Combined → Final Text

模型组件

组件	输入	输出	大小
Encoder	chunk [1,T,80] + caches	[1,T_enc,512] + new caches	343 MB
CTC	hidden [1,T_enc,512]	[1,T_enc,5538]	11 MB
Decoder	hyps + encoder_out	[NBEST,L,5538]	157 MB

NPU适配方法

通过onnx2torch将ONNX模型转换为PyTorch FX GraphModules，然后使用torch_npu在昇腾NPU上执行。该方法：

保留原始模型权重和计算图
无需ATC模型转换即可实现NPU加速
与CPU参考结果的argmax输出完全一致（100% token准确率）
三个子模型的MAPE均<1%

转换流程

ONNX Model → onnx2torch.convert() → torch.fx.GraphModule → .to("npu") → torch_npu inference

性能

模型	CPU (秒)	NPU (秒)	加速比
编码器	0.4749	0.7772	0.61x
CTC	0.0051	0.0026	1.97x
解码器	0.3929	0.2247	1.75x
总计	0.8729	1.0045	0.87x

注：编码器NPU性能包含onnx2torch图优化开销。通过ATC OM转换可进一步优化。

精度验证

CPU（onnxruntime）与NPU（torch_npu）输出的全面精度对比：

模型	最大绝对误差	平均绝对误差	平均绝对百分比误差	余弦相似度	令牌准确率
编码器	0.01127	0.00048	0.8653%	0.99999945	100.00%
CTC	0.04132	0.00461	0.0213%	0.99999997	100.00%
解码器	0.00240	0.00035	0.0034%	1.00000000	100.00%

文件

文件	描述
`inference.py`	主推理脚本（CPU + NPU后端）
`evaluate_npu.py`	NPU与CPU精度评估脚本
`evaluation_results.json`	完整精度+性能指标
`encoder.onnx`	原始编码器ONNX模型
`decoder.onnx`	原始解码器ONNX模型
`ctc.onnx`	原始CTC ONNX模型
`tokens.txt`	词汇表（5538个令牌）
`configuration.json`	模型元数据
`test_wavs/`	测试音频样本（4个WAV文件）

引用

@inproceedings{wenet2021,
  title={WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit},
  author={Yao, Zhuoyuan and Wu, Di and Wang, Xiong and Zhang, Binbin and Yu, Fan and Yang, Chao and Peng, Zhendong and Chen, Xiaoyu and Xie, Lei and Lei, Xin},
  booktitle={Proc. Interspeech},
  year={2021}
}

许可协议

本模型基于 Apache License 2.0 许可协议发布。