Delicate02/WeSpeaker-ResNet34-LM-MLX-Ascend

WeSpeaker-ResNet34-LM-MLX on Ascend NPU

1. 简介

本文档记录 WeSpeaker-ResNet34-LM 在华为 Ascend NPU 上的适配、推理验证与性能调优结果。

原模型为 aufklarer/WeSpeaker-ResNet34-LM-MLX，是 WeSpeaker ResNet34-LM 的 MLX 格式权重（channels-last，BatchNorm 已融合到 Conv2d）。本文档提供：

权重转换脚本：将 MLX safetensors 转换为 PyTorch state_dict
PyTorch 模型实现：无 BatchNorm 的 ResNet34 + Statistics Pooling
NPU 推理脚本：基于 torch_npu 的端到端推理
精度 / 性能评测：与 CPU 基线对比，验证误差 < 1%

模型信息：

参数量：~6.6M
输入：16kHz 音频 → 80-dim log-mel filterbank → [B, T, 80]
输出：256-dim L2-normalized speaker embedding

2. 验证环境

组件	版本
CANN	`8.5.1`
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
Python	`3.11`

NPU：1 逻辑卡（Ascend 910 系列）
模型路径：/opt/atomgit/WeSpeaker-ResNet34-LM-MLX/pytorch_model.bin

3. 快速开始

3.1 环境准备

pip install torch==2.9.0+cpu --index-url https://download.pytorch.org/whl/cpu
pip install torch-npu==2.9.0.post1
pip install safetensors numpy scipy librosa

torch-npu 安装请参考昇腾官方文档。

3.2 下载权重与代码

git clone https://gitcode.com/hf_mirrors/aufklarer/WeSpeaker-ResNet34-LM-MLX.git
cd WeSpeaker-ResNet34-LM-MLX

3.3 权重转换（MLX → PyTorch）

python3 convert_weights.py \
  --input model.safetensors \
  --output pytorch_model.bin

转换说明：

MLX Conv2d 权重格式 [O, H, W, I] → PyTorch [O, I, H, W]（transpose(0,3,1,2)）
Linear 权重格式一致，无需转换
全部 74 个 tensor，约 26MB

3.4 推理验证

随机输入 Demo：

python3 inference.py \
  --checkpoint pytorch_model.bin \
  --config config.json \
  --device npu \
  --warmup 5 \
  --iterations 20

音频文件推理：

python3 inference.py \
  --checkpoint pytorch_model.bin \
  --config config.json \
  --device npu \
  --audio sample.wav \
  --audio2 sample2.wav

预计算 fbank 推理：

python3 inference.py \
  --checkpoint pytorch_model.bin \
  --config config.json \
  --device npu \
  --audio features.npy \
  --fbank

3.5 精度评测

python3 benchmark_accuracy.py \
  --checkpoint pytorch_model.bin \
  --config config.json \
  --num_samples 100 \
  --output accuracy_result.json

评测方式：

随机生成 100 组 fbank 特征（帧长 50~400）
同一输入分别在 CPU 和 NPU 上推理
对比 embedding 的 相对误差 与 余弦相似度

3.6 性能评测

python3 benchmark_perf.py \
  --checkpoint pytorch_model.bin \
  --config config.json \
  --device npu \
  --output perf_result_npu.json

4. 精度评测结果

指标	数值
评测样本数	`100`
平均相对误差	`0.0799%`
最大相对误差	`0.1487%`
平均余弦相似度	`1.000000`
最小余弦相似度	`0.999999`
平均最大绝对误差	`0.000153`
最大最大绝对误差	`0.000330`

结论： NPU 推理结果与 CPU 基线高度一致，相对误差 < 1%，满足精度要求。

5. 性能参考

NPU 性能数据（warmup=5, iterations=50）：

Batch	Frames	Mean(ms)	Std(ms)	P99(ms)	Throughput(samples/s)
1	100	3.939	0.114	4.359	253.86
1	200	3.888	0.051	4.011	257.23
1	400	3.848	0.042	3.948	259.86
4	200	3.882	0.128	4.386	1030.35
8	200	4.084	0.028	4.163	1958.98
16	200	5.808	0.029	5.906	2754.71

单样本（200 帧）延迟约 3.9 ms
batch=16 时吞吐量可达 2755 samples/s
帧长变化对延迟影响较小（kernel launch 开销占主导）

6. 模型架构

Input: [B, T, 80] log-mel spectrogram (80 fbank, 16kHz)
  │
  ├─ Conv2d(1→32, k=3, p=1) + ReLU
  ├─ Layer1: 3× BasicBlock(32→32)
  ├─ Layer2: 4× BasicBlock(32→64, stride=2)
  ├─ Layer3: 6× BasicBlock(64→128, stride=2)
  ├─ Layer4: 3× BasicBlock(128→256, stride=2)
  │
  ├─ Statistics Pooling: mean + std over time → [B, 5120]
  ├─ Linear(5120→256)
  ├─ L2 Normalize
  │
  Output: [B, 256] speaker embedding

无 BatchNorm：原模型在转换时已将 BN 参数融合到 Conv2d 的 weight/bias 中
Statistics Pooling：对时间维度（dim=2）计算 mean 与 std，拼接后 flatten

7. 文件说明

文件	说明
`model.py`	PyTorch 模型定义（ResNet34 + StatsPool）
`convert_weights.py`	MLX safetensors → PyTorch bin 转换脚本
`inference.py`	推理脚本（支持音频/fbank 输入）
`benchmark_accuracy.py`	精度评测脚本（CPU vs NPU）
`benchmark_perf.py`	性能评测脚本（多 batch/多帧长）
`config.json`	模型配置（层数、通道数、embedding 维度等）
`pytorch_model.bin`	转换后的 PyTorch 权重
`accuracy_result.json`	精度评测结果
`perf_result_npu.json`	NPU 性能评测结果
`inference_log.txt`	推理运行日志

8. 注意事项

输入长度：由于模型包含 3 次 stride=2 的下采样，输入时间帧 T 需满足 T >= 8，否则会在 Statistics Pooling 阶段报错。短音频会自动 pad 到 8 帧。
torch_npu 权限警告：若运行时出现 /usr/local/Ascend/cann-8.5.1 owner does not match，属于环境权限提示，不影响推理正确性。
fbank 计算：默认使用 librosa.feature.melspectrogram，参数与 WeSpeaker 原始实现保持一致（n_fft=512, hop_length=160）。
内存占用：单卡 batch=16、200 帧时 NPU 显存占用约 200MB，属于轻量级模型。

9. 许可证

原 WeSpeaker 模型基于 MIT License 发布。