Distil-Whisper Large-v3 Ascend 部署指南

概述

本项目提供 Distil-Whisper large-v3 模型在华为昇腾 NPU 上的部署方案，基于 Transformers + torch_npu 实现自动语音识别 (ASR) 推理。模型仓：https://ai.gitcode.com/hf_mirrors/distil-whisper/distil-large-v3，将本仓代码放进去就可执行推理

模型信息

属性	值
模型名称	distil-large-v3
架构	WhisperForConditionalGeneration
参数量	756M
编码器层数	32
解码器层数	2
隐藏层维度	1280
注意力头数	20
音频特征	128 mel bins
最大音频长度	30s
精度	FP16
推理速度	6.3x vs large-v3

环境要求

NPU: Atlas 910B3
CANN: 8.5.1+
Python: 3.11
PyTorch: 2.9.0+ with torch_npu
transformers: 4.39+
torchaudio: 0.24.0+

快速部署

1. 创建容器

docker run -itd \
  --name=test-distill \
  --privileged \
  --ipc=host \
  --net=host \
  --device=/dev/davinci_manager \
  --device=/dev/devmm_svm \
  --device=/dev/hisi_hdc \
  --device=/dev/davinci0 \
  --device=/dev/davinci1 \
  --device=/dev/davinci2 \
  --device=/dev/davinci3 \
  --device=/dev/davinci4 \
  --device=/dev/davinci5 \
  --device=/dev/davinci6 \
  --device=/dev/davinci7 \
  -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
  -v /usr/local/sbin:/usr/local/sbin:ro \
  -v /data/ysws/agentsp/distil-large-v3:/data/distil_model \
  -v /home:/home \
  -w /data/distil_model \
  quay.io/ascend/vllm-ascend:v0.18.0rc1 \
  /bin/bash

2. 运行推理

docker exec test-distill bash -c "source /usr/local/Ascend/ascend-toolkit/set_env.sh && cd /data/ysws/agentsp/distil-large-v3-ascend && python3 inference.py"

docker exec test-distill bash -c "source /usr/local/Ascend/ascend-toolkit/set_env.sh && cd /data/ysws/agentsp/distil-large-v3-ascend && python3 inference.py --audio /path/to/audio.wav"

推理参数说明

参数	默认值	说明
`--model_path`	/data/ysws/agentsp/distil-large-v3	模型目录
`--audio`	None	音频文件路径，不提供则使用随机音频
`--duration`	30	随机音频时长（秒）
`--precision_test`	False	运行精度测试

精度测试结果

docker exec test-modelagent bash -c "source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
cd /data/ysws/agentsp/distil-large-v3-ascend && \
python3 inference.py --model_path /data/ysws/agentsp/distil-large-v3 --precision_test"

精度测试结果

指标	实测值	阈值	状态
最大误差（总和）	0.00e+00	< 1e-3	通过
最大误差（平均值）	0.00e+00	< 1e-5	通过
最大误差（标准差）	0.00e+00	< 1e-5	通过

性能数据

操作	耗时
CPU 参考计算 (20 个张量)	0.3943秒
NPU 推理 (20 个张量)	0.2734秒

性能指标

指标	值
30秒音频推理时间	~7秒
音频特征维度	128 x 3000
输出	文本转录

关键配置

模型加载

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model = model.npu()

音频处理

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.npu().half()

语音识别示例

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "distil-whisper/distil-large-v3",
    torch_dtype=torch.float16,
)
model = model.npu()

processor = AutoProcessor.from_pretrained("distil-whisper/distil-large-v3")
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.npu().half()

with torch.no_grad():
    generated_ids = model.generate(input_features)

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

文件结构

distil-large-v3-ascend/
├── README.md       # 本文档
├── inference.py    # 推理脚本
└── log.txt        # 运行日志

已知限制

Whisper 模型需要 float16 精度在 NPU 上运行
不提供 attention_mask 时会有警告，但不影响功能
建议安装 librosa 用于真实音频文件处理
精度测试: NPU 与 CPU 误差为 0，完全一致

参考链接

Distil-Whisper Large-v3 Ascend 部署指南

概述

模型信息

属性	值
模型名称	distil-large-v3
架构	WhisperForConditionalGeneration
参数量	756M
编码器层数	32
解码器层数	2
隐藏层维度	1280
注意力头数	20
音频特征	128 mel bins
最大音频长度	30s
精度	FP16
推理速度	6.3x vs large-v3

环境要求

NPU: Atlas 910B3
CANN: 8.5.1+
Python: 3.11
PyTorch: 2.9.0+ with torch_npu
transformers: 4.39+
torchaudio: 0.24.0+

快速部署

1. 创建容器

docker run -itd \
  --name=test-distill \
  --privileged \
  --ipc=host \
  --net=host \
  --device=/dev/davinci_manager \
  --device=/dev/devmm_svm \
  --device=/dev/hisi_hdc \
  --device=/dev/davinci0 \
  --device=/dev/davinci1 \
  --device=/dev/davinci2 \
  --device=/dev/davinci3 \
  --device=/dev/davinci4 \
  --device=/dev/davinci5 \
  --device=/dev/davinci6 \
  --device=/dev/davinci7 \
  -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
  -v /usr/local/sbin:/usr/local/sbin:ro \
  -v /data/ysws/agentsp/distil-large-v3:/data/distil_model \
  -v /home:/home \
  -w /data/distil_model \
  quay.io/ascend/vllm-ascend:v0.18.0rc1 \
  /bin/bash

2. 运行推理

docker exec test-distill bash -c "source /usr/local/Ascend/ascend-toolkit/set_env.sh && cd /data/ysws/agentsp/distil-large-v3-ascend && python3 inference.py"

docker exec test-distill bash -c "source /usr/local/Ascend/ascend-toolkit/set_env.sh && cd /data/ysws/agentsp/distil-large-v3-ascend && python3 inference.py --audio /path/to/audio.wav"

推理参数说明

参数	默认值	说明
`--model_path`	/data/ysws/agentsp/distil-large-v3	模型目录
`--audio`	None	音频文件路径，不提供则使用随机音频
`--duration`	30	随机音频时长（秒）
`--precision_test`	False	运行精度测试

精度测试结果

docker exec test-modelagent bash -c "source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
cd /data/ysws/agentsp/distil-large-v3-ascend && \
python3 inference.py --model_path /data/ysws/agentsp/distil-large-v3 --precision_test"

精度测试结果

指标	实测值	阈值	状态
最大误差（总和）	0.00e+00	< 1e-3	通过
最大误差（平均值）	0.00e+00	< 1e-5	通过
最大误差（标准差）	0.00e+00	< 1e-5	通过

性能数据

操作	耗时
CPU 参考计算 (20 个张量)	0.3943秒
NPU 推理 (20 个张量)	0.2734秒

性能指标

指标	值
30秒音频推理时间	~7秒
音频特征维度	128 x 3000
输出	文本转录

关键配置

模型加载

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model = model.npu()

音频处理

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.npu().half()

语音识别示例

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "distil-whisper/distil-large-v3",
    torch_dtype=torch.float16,
)
model = model.npu()

processor = AutoProcessor.from_pretrained("distil-whisper/distil-large-v3")
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.npu().half()

with torch.no_grad():
    generated_ids = model.generate(input_features)

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

文件结构

distil-large-v3-ascend/
├── README.md       # 本文档
├── inference.py    # 推理脚本
└── log.txt        # 运行日志

已知限制

Whisper 模型需要 float16 精度在 NPU 上运行
不提供 attention_mask 时会有警告，但不影响功能
建议安装 librosa 用于真实音频文件处理
精度测试: NPU 与 CPU 误差为 0，完全一致