本文档记录 openai/whisper-medium 在华为昇腾 Ascend910B4 NPU 上的适配与验证结果。
whisper-medium 是 OpenAI 的语音识别(ASR)模型,采用 encoder-decoder transformer 架构:
由于 whisper-medium 是 encoder-decoder 架构,vLLM 不支持此类模型,因此直接使用 transformers + torch_npu 进行推理。
相关获取地址:
| 组件 | 版本 |
|---|---|
| NPU | Ascend910B4 |
torch | 2.9.0 |
torch_npu | 2.9.0.post1+gitee7ba04 |
transformers | 4.57.6 |
| CANN | 8.5.1 |
1 逻辑卡/opt/atomgit/models/whisper-medium# 方式一:HuggingFace 镜像
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download openai/whisper-medium --local-dir ./whisper-medium
# 方式二:ModelScope
python3 -c "from modelscope.hub.snapshot_download import snapshot_download; snapshot_download('openai/whisper-medium', cache_dir='./whisper-medium')"python3 inference.py /path/to/audio.wavimport torch
import torch_npu
from transformers import WhisperProcessor, WhisperForConditionalGeneration
device = torch.device("npu:0")
processor = WhisperProcessor.from_pretrained("/path/to/whisper-medium")
model = WhisperForConditionalGeneration.from_pretrained(
"/path/to/whisper-medium", dtype=torch.float32,
).to(device)
model.eval()
# Load audio (16kHz mono WAV recommended)
import soundfile as sf
audio, sr = sf.read("audio.wav")
if sr != 16000:
import librosa
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to(device)
with torch.no_grad():
generated = model.generate(input_features, language="english", task="transcribe")
transcription = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(transcription)关键说明:
float32 和 float16 两种精度使用 transformers + torch_npu 在 CPU 和 NPU 上分别运行相同输入,对比输出 logits 和生成结果。
| 指标 | 数值 |
|---|---|
| 最大概率差异 | 0.31% |
| Logits 余弦相似度 | 0.99999648 |
| Token 预测匹配率 | 100% |
| 生成序列精确匹配 | ✅ |
| 结论 | PASS(概率差异 < 1%) |
python3 accuracy_run.py测试条件:4s 音频输入 / 单请求 / batch_size=1,连续 20 次推理取统计值。
| 指标 | 数值 |
|---|---|
| Encoder 延迟 | 37.4 ms |
| 端到端延迟(均值) | 185.6 ms |
| 端到端延迟(P50) | 183.9 ms |
| 端到端延迟(P99) | 204.6 ms |
| 实时倍率(RTF) | 0.0464 |
| 加速比(vs CPU) | ~21x |
| 峰值内存 | ~3.0 GB |
| NPU HBM 使用 | ~2.8 GB / 32 GB |
python3 accuracy_run_perf.pyCPU Transcription: oh
NPU Transcription: oh
Max prob diff: 0.3107%
Cosine similarity: 0.99999648
Token match: TrueAudio duration: 4.00s
Encoder latency: 37.4ms
End-to-end latency: 181.7ms
Mean latency (20 runs): 185.6ms
RTF: 0.0464
Memory: ~3.0 GB RSS--dtype float16 半精度推理