MedASR on Ascend NPU

1. 简介

本文档记录 MedASR 在昇腾 Ascend NPU 环境的适配、部署与验证结果。MedASR 是 Google 推出的基于 Conformer 架构的医疗自动语音识别（ASR）模型，专门针对医学口述（如放射学报告）进行预训练。

本文档提供了完整的 NPU 适配方案，采用运行时 monkey-patch 方式实现，无需修改原始库代码。已验证模型在昇腾 NPU 上的推理精度与 CPU 参考结果误差小于 1%，且文本输出完全一致。

2. 验证环境

组件	版本
`transformers`	`5.0.0`
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`librosa`	`0.11.0`
`soundfile`	`0.13.1`

NPU：2 逻辑卡（Ascend910）
模型路径：/opt/atomgit/medasr/weights
操作系统：Linux 5.10.0

3. 环境准备

3.1 安装依赖

pip install transformers==5.0.0 torch librosa soundfile
# torch-npu 需根据 CANN 版本安装，已预装

3.2 模型权重下载

# 方式一：ModelScope
modelscope download --model google/medasr -d ./weights

# 方式二：AtomGit
python3 -m atomgit download hf_mirrors/google/medasr -d ./weights

4. 推理验证

4.1 快速推理

python3 inference.py --model_path ./weights --audio ./weights/test_audio.wav --device npu

4.2 推理示例代码

import torch
import librosa
from transformers import AutoModelForCTC, AutoProcessor

model_path = "./weights"
audio_path = "./weights/test_audio.wav"

device = "npu" if torch.npu.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForCTC.from_pretrained(model_path).to(device)
model.eval()

speech, sr = librosa.load(audio_path, sr=16000)
inputs = processor(speech, sampling_rate=sr, return_tensors="pt", padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)[0].tolist()

# CTC 解码：去空白、合并重复
prev = None
ctc_ids = []
for pid in predicted_ids:
    if pid != 0 and pid != prev:
        ctc_ids.append(pid)
    prev = pid

transcription = processor.tokenizer.decode(ctc_ids, skip_special_tokens=True)
print(transcription)

4.3 输出示例

[EXAM TYPE] CT chest PE protocol {period} [INDICATION] 54-year-old female, shortness of breath, evaluate for PE {period} [TECHNIQUE] Standard protocol {period} [FINDINGS] {colon} Pulmonary vasculature {colon} The main PA is patent {period} There are filling defects in the segmental branches of the right lower lobe {comma} compatible with acute PE {period} No saddle embolus {period} Lungs {colon} No pneumothorax {period} Small bilateral effusions {comma} right greater than left {period} {new paragraph} [IMPRESSION] {colon} Acute segmental PE, right lower lobe {period}

5. 性能参考

测试条件：test_audio.wav（43.80s 音频），连续 10 次推理取均值。

指标	数值
`Device`	`npu`
`Audio duration`	`43.80 s`
`Mean latency`	`22.29 ms`
`Median latency`	`22.19 ms`
`Std dev`	`0.27 ms`
`Mean RTF`	`0.0005`
`Throughput`	`44.85 audio/s`

运行性能测试：

python3 benchmark.py --model_path ./weights --audio ./weights/test_audio.wav --device npu --warmup 3 --iterations 10

6. 精度评测

使用 accuracy.py 对 NPU 输出与 CPU 参考输出进行逐 logits 对比和文本对比。

指标	数值
对比方式	CPU vs NPU
Max absolute error	`8.03e-03`
Mean absolute error	`4.61e-04`
Mean relative error	`1.51e-03`
RMSE	`6.23e-04`
文本完全匹配	`True`
Word Error Rate	`0.00%`

结论：NPU 推理精度与 CPU 参考结果平均相对误差小于 1%，且转录文本完全一致。

运行精度验证：

python3 accuracy.py --model_path ./weights --audio ./weights/test_audio.wav

7. NPU 适配说明

本项目采用运行时 monkey-patch 方式实现 NPU 适配，不修改 transformers 或 torch 原始库代码。

核心适配逻辑（见 inference.py）：

def apply_npu_monkey_patch():
    if hasattr(torch, "npu") and torch.npu.is_available():
        torch.cuda = torch.npu
        torch.cuda.is_available = torch.npu.is_available
        if hasattr(torch.npu, "current_device"):
            torch.cuda.current_device = torch.npu.current_device

在脚本启动时自动调用 apply_npu_monkey_patch()，即可使原本使用 torch.cuda 的代码无缝迁移到 NPU。

模型推理时直接调用 .to("npu")，利用 torch-npu 的算子兼容性完成计算图在 NPU 上的执行。

8. 注意事项

transformers 版本：MedASR 的 model_type 为 lasr_ctc，需要 transformers >= 5.0.0 才能识别。如果当前环境 transformers 版本过低，请执行 pip install transformers==5.0.0。
CTC 解码：MedASR 输出为 CTC logits，需要手动进行 CTC 后处理（合并重复 token、移除 blank <epsilon>），已在 inference.py 中封装为 ctc_decode() 函数。
音频采样率：模型要求输入音频采样率为 16000 Hz，推理前请使用 librosa.load(..., sr=16000) 进行重采样。
NPU 同步：NPU 推理为异步执行，性能测试时需调用 torch.npu.synchronize() 确保计时准确。

9. 文件清单

文件	说明
`inference.py`	推理脚本，支持 CPU/NPU 自动切换
`benchmark.py`	性能评测脚本
`accuracy.py`	精度验证脚本（对比 CPU 与 NPU）
`readme.md`	本文档
`logs/inference_npu.log`	推理运行日志
`logs/benchmark_npu.log`	性能测试日志
`logs/accuracy_npu.log`	精度验证日志