distil-large-v3 on Ascend NPU

1. 简介

本文档记录 distil-whisper/distil-large-v3 在华为昇腾 NPU（Ascend 910B4）上的适配、推理与精度验证结果。

distil-large-v3 是基于 OpenAI Whisper large-v3 蒸馏的语音识别模型，采用 Encoder-Decoder Seq2Seq 架构，参数量约 756M。本验证基于原生 transformers + torch_npu 完成适配，无需修改模型结构或自定义算子，仅需将模型加载到 NPU 设备即可跑通推理。

2. 验证环境

组件	版本
CANN	`8.5.1`
PyTorch	`2.9.0+cpu`
torch-npu	`2.9.0.post1+gitee7ba04`
transformers	`4.57.6`
librosa	`0.11.0`
numpy	`1.26.4`

NPU：1 逻辑卡（Ascend 910B4，32GB HBM）
模型路径：/opt/atomgit/distil-large-v3-npu-adaptation/model_cache

3. 模型下载

由于 HuggingFace 官方直连受限，建议通过 hf-mirror.com 或 ModelScope 下载权重：

# 方式一：hf-mirror.com
export HF_ENDPOINT=https://hf-mirror.com
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='distil-whisper/distil-large-v3',
    local_dir='./model_cache',
    local_dir_use_symlinks=False,
)
"

4. 推理脚本

4.1 环境检查

# 检查 NPU 可用性
python -c "import torch; import torch_npu; print(torch.npu.is_available())"

# 检查 NPU 设备信息
npu-smi info

4.2 CPU 基准推理

python inference.py \
  --device cpu \
  --dtype float32 \
  --output results_cpu_float32.pkl

4.3 NPU 推理

# float16 推荐（性能最优）
python inference.py \
  --device npu \
  --dtype float16 \
  --output results_npu_float16.pkl

# float32（精度最高）
python inference.py \
  --device npu \
  --dtype float32 \
  --output results_npu_float32.pkl

脚本自动完成以下流程：

生成可复现的合成测试音频（5秒，混合正弦波 + 噪声）
使用 AutoProcessor 提取 log-mel spectrogram
Encoder 前向传播（输出 last hidden states）
Decoder generate() 贪婪解码生成文本
提取 decoder 第一步 logits 用于细粒度对比
保存结果到 pickle 文件

5. 精度验证

5.1 评估标准

采用多维度综合评估，避免 encoder 中间特征中接近零值放大相对误差：

维度	指标	阈值
Decoder logits	相对误差	`< 1%`
Encoder hidden states	Cosine similarity	`> 0.999`
生成结果	Token 完全一致	必须一致

5.2 验证方法

python compare.py \
  --cpu results_cpu_float32.pkl \
  --npu results_npu_float16.pkl

5.3 验证结果

NPU float32 vs CPU float32

指标	数值	阈值	结果
Decoder logits 相对误差	0.013%	< 1%	PASS
Encoder cosine similarity	0.999875	> 0.999	PASS
生成 Token 一致性	完全一致	完全一致	PASS
综合判定	-	-	PASS

NPU float16 vs CPU float32

指标	数值	阈值	结果
Decoder logits 相对误差	0.040%	< 1%	PASS
Encoder cosine similarity	0.999870	> 0.999	PASS
生成 Token 一致性	完全一致	完全一致	PASS
综合判定	-	-	PASS

5.4 误差分析

Encoder hidden states 的原始相对误差（含接近零值）约 10%~11%，但排除 |ref| < 0.1 后降至 1.5% 左右。
Encoder 绝对误差非常小（mean diff ≈ 0.005，max diff ≈ 2.4），且 cosine similarity > 0.999，说明特征方向高度一致。
Decoder logits 和最终生成结果完全一致，证明 NPU 推理对 ASR 任务完全正确。

6. 性能参考

测试条件：单条 5 秒合成音频，batch_size=1。

设备	精度	Encoder 耗时	Generate 耗时	总耗时
CPU	float32	126.2 s	127.8 s	~254 s
NPU	float32	18.9 s	0.39 s	~19.3 s
NPU	float16	20.2 s	0.36 s	~20.6 s

NPU float32 相比 CPU float32，Encoder 加速 6.7x，Generate 加速 355x
NPU float16 与 float32 性能接近，但显存占用更低（~1.5GB vs ~3GB）

7. 注意事项

权重下载：HuggingFace 官方直连可能超时，建议使用 HF_ENDPOINT=https://hf-mirror.com 镜像。
模型加载：AutoModelForSpeechSeq2Seq.from_pretrained 需指定 local_dir 或离线模式，避免自动联网检查。
设备迁移：NPU 上需显式调用 model.to("npu")，input_features 也需同步迁移。
精度模式：float16 在 NPU 上已验证精度满足要求，推荐用于生产部署以节省显存。
日志目录：若看到 can not create directory, directory: /home/atomgit/ascend/log 警告，可忽略或手动创建该目录。