Pyannote Speaker Diarization — NPU Deployment

1. 简介

本文档记录 pyannote/speaker-diarization 说话人日志（Speaker Diarization）模型在华为昇腾 Ascend 910B4 NPU 环境的快速部署与验证结果。

speaker-diarization 是 pyannote.audio 的经典说话人日志流水线。本适配方案使用 speaker-diarization-community-1 的本地子模型替代原始引用的远程 HuggingFace 模型（pyannote/segmentation@2022.07 和 speechbrain/spkrec-ecapa-voxceleb），实现离线 NPU 推理。

模型在 NPU 上的整体适配方案基于以下补丁：

Torchcodec → soundfile 替换（NPU 无 CUDA NVRTC）
FFT 计算路由到 CPU（Ascend aclnnAbs 不支持复数输入）
CUDA 设备检测替换为 NPU 设备检测

2. 验证环境

组件	版本
`pyannote.audio`	`4.0.4`
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1`
`CANN`	`8.5.1`
`soundfile`	`0.13.1`
`Python`	`3.11.14`

NPU：Ascend 910B4，1 逻辑卡
模型路径：models/（依赖 community-1 的子模型）
依赖安装：

pip install pyannote.audio==4.0.4 torch_npu soundfile

3. 推理脚本

3.1 快速推理（命令行）

python3 inference.py /path/to/audio.wav --model basic --output result.json

可选参数：

参数	说明	默认值
`--model`	模型类型：`community-1` / `basic`	`community-1`
`--device`	运行设备：`auto` / `npu` / `cpu` / `cuda`	`auto`
`--output`	结果输出 JSON 路径	不输出
`--num-speakers`	强制指定说话人数	自动
`--min-speakers`	最少说话人数	自动
`--max-speakers`	最多说话人数	自动

3.2 Python API 调用

from inference import apply_patches, load_pipeline, run_diarization

apply_patches()
pipeline = load_pipeline(model="basic", device="auto")

result = run_diarization(pipeline, "audio.wav")
print(f"检测到 {result['num_speakers']} 个说话人")
for seg in result["diarization"]:
    print(f"  {seg['speaker']}: {seg['start']}s -> {seg['end']}s")

4. Smoke 验证

python3 -c "
from inference import apply_patches, load_pipeline
apply_patches()
pipeline = load_pipeline(model='basic', device='npu')
for name, inf in pipeline._inferences.items():
    for attr in ['model', 'model_']:
        if hasattr(inf, attr):
            dev = next(getattr(inf, attr).parameters()).device
            print(f'{name} on {dev}')
"

验证结果：

Segmentation 模型在 npu:0 ✓
Embedding 模型在 npu:0（Fbank 计算路由到 CPU） ✓
PLDA + VBx 聚类正常工作 ✓

5. 精度评测

精度验证

使用 DiarizationErrorRate 指标对比 NPU 与 CPU 输出的一致性：

python3 scripts/evaluate.py

音频时长	NPU vs CPU DER	NPU 推理时间	CPU 推理时间	加速比
10s	0.00%	0.052s	0.737s	14.2x
30s	0.00%	0.074s	7.234s	97.8x

NPU 输出与 CPU 输出在 Diarization Error Rate 上完全一致（DER = 0.00%），精度无损。

6. 性能参考

6.1 NPU 推理性能（Ascend 910B4）

测试条件：合成音频，单一流水线推理，5 次取平均。

指标	10s 音频	30s 音频
平均推理时间	0.052s	0.074s
RTF（实时率）	0.0052	0.0025
吞吐（x 实时）	192x	405x
推理时间标准差	0.001s	0.003s

6.2 性能分析

与 community-1 性能相当：basic 模型与 community-1 使用相同的子模型，性能指标基本一致
极低 RTF：30s 音频 RTF 低至 0.0025，即每秒音频仅需 2.5ms 推理
高吞吐：405 倍实时处理能力，适合批量和流式处理场景
性能稳定：多次推理的标准差低至 0.003s

7. 适配要点

与 speaker-diarization-community-1 相同的适配方案：

7.1 torchcodec 替换

torchcodec 的 AudioDecoder 依赖 CUDA NVRTC（libnvrtc.so），NPU 环境不可用。使用 soundfile 实现替代解码器。

7.2 FFT 计算路由

WeSpeaker ResNet 的 compute_fbank 使用 torch.fft.rfft 产生 complex64 张量。Ascend NPU 的 aclnnAbs 算子不支持复数输入，将 FFT 计算路由到 CPU。

7.3 设备检测

pyannote.audio.pipelines.utils.getter.get_devices() 硬编码了 torch.cuda.device_count()，替换为 torch.npu.device_count()。

7.4 子模型依赖

basic 模型原始配置引用远程 HuggingFace 模型（pyannote/segmentation@2022.07、speechbrain/spkrec-ecapa-voxceleb）。本适配改用 speaker-diarization-community-1 的本地子模型，避免远程下载和鉴权问题。

8. 注意事项

依赖 community-1 子模型：basic 流水线需要 speaker-diarization-community-1 的 segmentation、embedding、plda 子模型
音频格式：推荐使用 16kHz 单声道 WAV 格式
torchcodec 警告：NPU 环境下 torchcodec 加载失败属于正常现象
Fbank 计算：FFT 路由到 CPU 导致 embedding 推理时存在 CPU↔NPU 数据传输
模型目录结构：需要 models/speaker-diarization-community-1/{segmentation,embedding,plda}/ 子目录