weixin_62994174/ofa_mmspeech_asr_aishell1_base_zh

OFA-MMSpeech ASR (AIShell1 Base) - 昇腾 NPU 适配版

模型介绍

MMSpeech 是达摩院自研的语音预训练模型，基于 OFA 架构，能够充分利用无标注文本，显著降低字错误率。在标准 benchmark AIShell1 的验证集/测试集上的字错误率降低了 48.3%/42.4%，效果达到 1.6%/1.9%，远超 SOTA 3.1%/3.3%。

本仓库是 iic/ofa_mmspeech_asr_aishell1_base_zh 的昇腾 NPU 适配版本，已完整验证在华为 Ascend 910 NPU 上的推理精度和性能。

项目	信息
原始模型	iic/ofa_mmspeech_asr_aishell1_base_zh
模型类型	Non-autoregressive ASR (Encoder-Decoder)
参数量	210M
预训练数据	AISHELL-2 (无标注语音) + M6-Corpus (无标注文本)
微调数据	AISHELL-1 (标注数据)
适配硬件	华为 Ascend 910 NPU
框架	PyTorch + torch_npu
精度	与 CPU 基线完全一致 (CER = 0.0%)

NPU 适配说明

适配内容

权重转换：将 HuggingFace 格式的 Q/K/V 分离注意力权重转换为 PyTorch MultiheadAttention 所需的 in_proj 合并格式
音频预处理：实现 Cepstral Mean and Variance Normalization (CMVN) 以匹配 fairseq 预处理管线
词表对齐：添加 MMSpeech 特有的 <blank> 和 30000 个 <audio_N> 标记，使 tokenizer 词表与模型输出维度匹配（51134）
解码约束：配置 constraint_range 将解码输出限制在文本词表范围（4-21134），避免生成音频编码标记

环境依赖

CANN >= 8.5.1
torch_npu >= 2.9.0
Python >= 3.10
modelscope >= 1.35.0
torchaudio >= 2.9.0
librosa >= 0.9.0, < 0.10.0

精度验证

测试环境

项目	配置
NPU 型号	Ascend 910 × 2
CANN 版本	8.5.1
torch_npu 版本	2.9.0.post1
Python 版本	3.11.14

验证结果

指标	CPU	NPU	结果
输出文本	甚至出现交易几乎停滞的情况	甚至出现交易几乎停滞的情况	完全一致
首次推理耗时	5.548s	9.685s	-
CER (字符错误率)	-	0.0%	✅ 通过 (< 1%)
精确匹配	-	True	✅ 通过

性能基准

NPU 推理性能（预热后 10 次运行）

指标	数值
平均延迟	0.2284s
P50 延迟	0.2273s
P95 延迟	0.2359s
标准差	0.0056s
最小延迟	0.2197s
最大延迟	0.2360s
输出一致性	10/10 完全一致

使用方法

1. 环境准备

pip install modelscope torchaudio librosa torch_npu

2. 下载模型

pip install modelscope
modelscope download --model iic/ofa_mmspeech_asr_aishell1_base_zh --local_dir ./ofa_mmspeech_asr_aishell1_base_zh

3. 权重转换

cd ofa_mmspeech_asr_aishell1_base_zh
python3 convert_weights.py

4. CPU 推理

python3 inference.py --device cpu --audio test_audio.wav

5. NPU 推理

python3 inference.py --device npu --audio test_audio.wav

6. CPU vs NPU 精度对比

python3 benchmark.py --compare --num_runs 10

API 调用示例

from inference import OFAMMSpeechASR

# CPU 推理
asr = OFAMMSpeechASR("ofa_mmspeech_asr_aishell1_base_zh", device="cpu")
text, time_used = asr.infer_with_timing("test_audio.wav")
print(f"CPU: {text} ({time_used:.3f}s)")

# NPU 推理
asr = OFAMMSpeechASR("ofa_mmspeech_asr_aishell1_base_zh", device="npu")
text, time_used = asr.infer_with_timing("test_audio.wav")
print(f"NPU: {text} ({time_used:.3f}s)")

测试音频

测试音频为 ModelScope 官方提供的 ASR 示例音频：

https://modelscope.oss-cn-beijing.aliyuncs.com/demo/audios/asr_example_ofa.wav

期望输出: 甚至出现交易几乎停滞的情况

文件说明

文件	说明
`inference.py`	NPU/CPU 推理脚本，支持单次推理和性能基准
`benchmark.py`	精度对比和性能评测脚本
`convert_weights.py`	权重格式转换脚本 (QKV → in_proj)
`benchmark_results.json`	评测结果 (JSON 格式)
`pytorch_model_converted.bin`	转换后的模型权重
`test_audio.wav`	测试音频文件

数据评估

在 AISHELL-1 的 dev/test 数据集上的原始测试结果：

Model	dev(w/o LM)	dev(with LM)	test(w/o LM)	test(with LM)
MMSpeech-Base-aishell1	2.4	2.1	2.6	2.3

Model	Model Size	预训练	微调
MMSpeech-Base	210M	预训练Base	AIShell1微调Base
MMSpeech-Large	609M	预训练Large	AIShell1微调Large

引用

如果你觉得 OFA-MMSpeech 好用，欢迎引用：

@article{zhou2022mmspeech,
  author    = {Zhou, Xiaohuan and
               Wang, Jiaming and
               Cui, Zeyu and
               Zhang, Shiliang and
               Yan, Zhijie and
               Zhou, Jingren and
               Zhou, Chang},
  title     = {MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition},
  journal   = {arXiv preprint arXiv:2212.00500},
  year      = {2022}
}

@article{wang2022ofa,
  author    = {Peng Wang and
               An Yang and
               Rui Men and
               Junyang Lin and
               Shuai Bai and
               Zhikang Li and
               Jianxin Ma and
               Chang Zhou and
               Jingren Zhou and
               Hongxia Yang},
  title     = {OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence
               Learning Framework},
  journal   = {CoRR},
  volume    = {abs/2202.03052},
  year      = {2022}
}

许可证

Apache License 2.0

OFA-MMSpeech ASR (AIShell1 Base) - 昇腾 NPU 适配版

模型介绍

本仓库是 iic/ofa_mmspeech_asr_aishell1_base_zh 的昇腾 NPU 适配版本，已完整验证在华为 Ascend 910 NPU 上的推理精度和性能。

项目	信息
原始模型	iic/ofa_mmspeech_asr_aishell1_base_zh
模型类型	Non-autoregressive ASR (Encoder-Decoder)
参数量	210M
预训练数据	AISHELL-2 (无标注语音) + M6-Corpus (无标注文本)
微调数据	AISHELL-1 (标注数据)
适配硬件	华为 Ascend 910 NPU
框架	PyTorch + torch_npu
精度	与 CPU 基线完全一致 (CER = 0.0%)

NPU 适配说明

适配内容

权重转换：将 HuggingFace 格式的 Q/K/V 分离注意力权重转换为 PyTorch MultiheadAttention 所需的 in_proj 合并格式
音频预处理：实现 Cepstral Mean and Variance Normalization (CMVN) 以匹配 fairseq 预处理管线
词表对齐：添加 MMSpeech 特有的 <blank> 和 30000 个 <audio_N> 标记，使 tokenizer 词表与模型输出维度匹配（51134）
解码约束：配置 constraint_range 将解码输出限制在文本词表范围（4-21134），避免生成音频编码标记

环境依赖

CANN >= 8.5.1
torch_npu >= 2.9.0
Python >= 3.10
modelscope >= 1.35.0
torchaudio >= 2.9.0
librosa >= 0.9.0, < 0.10.0

精度验证

测试环境

项目	配置
NPU 型号	Ascend 910 × 2
CANN 版本	8.5.1
torch_npu 版本	2.9.0.post1
Python 版本	3.11.14

验证结果

指标	CPU	NPU	结果
输出文本	甚至出现交易几乎停滞的情况	甚至出现交易几乎停滞的情况	完全一致
首次推理耗时	5.548s	9.685s	-
CER (字符错误率)	-	0.0%	✅ 通过 (< 1%)
精确匹配	-	True	✅ 通过

性能基准

NPU 推理性能（预热后 10 次运行）

指标	数值
平均延迟	0.2284s
P50 延迟	0.2273s
P95 延迟	0.2359s
标准差	0.0056s
最小延迟	0.2197s
最大延迟	0.2360s
输出一致性	10/10 完全一致

使用方法

1. 环境准备

pip install modelscope torchaudio librosa torch_npu

2. 下载模型

pip install modelscope
modelscope download --model iic/ofa_mmspeech_asr_aishell1_base_zh --local_dir ./ofa_mmspeech_asr_aishell1_base_zh

3. 权重转换

cd ofa_mmspeech_asr_aishell1_base_zh
python3 convert_weights.py

4. CPU 推理

python3 inference.py --device cpu --audio test_audio.wav

5. NPU 推理

python3 inference.py --device npu --audio test_audio.wav

6. CPU vs NPU 精度对比

python3 benchmark.py --compare --num_runs 10

API 调用示例

from inference import OFAMMSpeechASR

# CPU 推理
asr = OFAMMSpeechASR("ofa_mmspeech_asr_aishell1_base_zh", device="cpu")
text, time_used = asr.infer_with_timing("test_audio.wav")
print(f"CPU: {text} ({time_used:.3f}s)")

# NPU 推理
asr = OFAMMSpeechASR("ofa_mmspeech_asr_aishell1_base_zh", device="npu")
text, time_used = asr.infer_with_timing("test_audio.wav")
print(f"NPU: {text} ({time_used:.3f}s)")

测试音频

测试音频为 ModelScope 官方提供的 ASR 示例音频：

https://modelscope.oss-cn-beijing.aliyuncs.com/demo/audios/asr_example_ofa.wav

期望输出: 甚至出现交易几乎停滞的情况

文件说明

文件	说明
`inference.py`	NPU/CPU 推理脚本，支持单次推理和性能基准
`benchmark.py`	精度对比和性能评测脚本
`convert_weights.py`	权重格式转换脚本 (QKV → in_proj)
`benchmark_results.json`	评测结果 (JSON 格式)
`pytorch_model_converted.bin`	转换后的模型权重
`test_audio.wav`	测试音频文件

数据评估

在 AISHELL-1 的 dev/test 数据集上的原始测试结果：

Model	dev(w/o LM)	dev(with LM)	test(w/o LM)	test(with LM)
MMSpeech-Base-aishell1	2.4	2.1	2.6	2.3

Model	Model Size	预训练	微调
MMSpeech-Base	210M	预训练Base	AIShell1微调Base
MMSpeech-Large	609M	预训练Large	AIShell1微调Large

引用

如果你觉得 OFA-MMSpeech 好用，欢迎引用：

@article{zhou2022mmspeech,
  author    = {Zhou, Xiaohuan and
               Wang, Jiaming and
               Cui, Zeyu and
               Zhang, Shiliang and
               Yan, Zhijie and
               Zhou, Jingren and
               Zhou, Chang},
  title     = {MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition},
  journal   = {arXiv preprint arXiv:2212.00500},
  year      = {2022}
}

@article{wang2022ofa,
  author    = {Peng Wang and
               An Yang and
               Rui Men and
               Junyang Lin and
               Shuai Bai and
               Zhikang Li and
               Jianxin Ma and
               Chang Zhou and
               Jingren Zhou and
               Hongxia Yang},
  title     = {OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence
               Learning Framework},
  journal   = {CoRR},
  volume    = {abs/2202.03052},
  year      = {2022}
}

许可证

Apache License 2.0

OFA-MMSpeech ASR (AIShell1 Base) - 昇腾 NPU 适配版

模型介绍

NPU 适配说明

适配内容

环境依赖

精度验证

测试环境

验证结果

性能基准

NPU 推理性能（预热后 10 次运行）

使用方法

1. 环境准备

2. 下载模型

3. 权重转换

4. CPU 推理

5. NPU 推理

6. CPU vs NPU 精度对比

API 调用示例

测试音频

文件说明

数据评估

相关模型

引用

许可证

OFA-MMSpeech ASR (AIShell1 Base) - 昇腾 NPU 适配版

模型介绍

NPU 适配说明

适配内容

环境依赖

精度验证

测试环境

验证结果

性能基准

NPU 推理性能（预热后 10 次运行）

使用方法

1. 环境准备

2. 下载模型

3. 权重转换

4. CPU 推理

5. NPU 推理

6. CPU vs NPU 精度对比

API 调用示例

测试音频

文件说明

数据评估

相关模型

引用

许可证