m0_74196153/speaker-diarization-community-1

speaker-diarization-community-1:用户可快速在华为昇腾Ascend 910B4 NPU环境部署说话人日志模型，实现音频中说话人分割与聚类。项目基于pyannote/speaker-diarization-community-1，包含三大子模型，精度无损且推理速度比CPU提升14-100倍。【此简介由AI生成】 - AtomGit AI社区

Pyannote Speaker Diarization Community-1 — NPU Deployment

1. 简介

本文档记录 pyannote/speaker-diarization-community-1 说话人日志（Speaker Diarization）模型在华为昇腾 Ascend 910B4 NPU 环境的快速部署与验证结果。

speaker-diarization-community-1 是 pyannote.audio 社区发布的完整说话人日志流水线，包含三个子模型：

Segmentation（PyanNet）：将音频分割为说话人片段
Embedding（WeSpeaker ResNet34-LM）：提取 256 维说话人嵌入
PLDA + VBx Clustering：基于嵌入向量进行聚类，区分不同说话人

模型在 NPU 上的整体适配方案基于以下补丁：

Torchcodec → soundfile 替换（NPU 无 CUDA NVRTC）
FFT 计算路由到 CPU（Ascend aclnnAbs 不支持复数输入）
CUDA 设备检测替换为 NPU 设备检测

2. 验证环境

组件	版本
`pyannote.audio`	`4.0.4`
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1`
`CANN`	`8.5.1`
`soundfile`	`0.13.1`
`Python`	`3.11.14`

NPU：Ascend 910B4，1 逻辑卡
模型路径：models/speaker-diarization-community-1/
依赖安装：

pip install pyannote.audio==4.0.4 torch_npu soundfile

3. 推理脚本

3.1 快速推理（命令行）

python3 inference.py /path/to/audio.wav --model community-1 --output result.json

可选参数：

参数	说明	默认值
`--model`	模型类型：`community-1` / `basic`	`community-1`
`--device`	运行设备：`auto` / `npu` / `cpu` / `cuda`	`auto`
`--output`	结果输出 JSON 路径	不输出
`--num-speakers`	强制指定说话人数	自动
`--min-speakers`	最少说话人数	自动
`--max-speakers`	最多说话人数	自动

3.2 Python API 调用

from inference import apply_patches, load_pipeline, run_diarization

# 应用 NPU 补丁
apply_patches()

# 加载流水线（自动检测 NPU）
pipeline = load_pipeline(model="community-1", device="auto")

# 运行说话人日志
result = run_diarization(pipeline, "audio.wav")
print(f"检测到 {result['num_speakers']} 个说话人")
for seg in result["diarization"]:
    print(f"  {seg['speaker']}: {seg['start']}s -> {seg['end']}s")

3.3 完整测试

python3 scripts/test_npu.py

4. Smoke 验证

基础功能验证：

# 加载流水线并验证子模型在 NPU 上
python3 -c "
from inference import apply_patches, load_pipeline
apply_patches()
pipeline = load_pipeline(model='community-1', device='npu')
# 验证子模型已在 NPU
for name, inf in pipeline._inferences.items():
    for attr in ['model', 'model_']:
        if hasattr(inf, attr):
            dev = next(getattr(inf, attr).parameters()).device
            print(f'{name} on {dev}')
"

验证结果：

Segmentation 模型在 npu:0 ✓
Embedding 模型在 npu:0（Fbank 计算路由到 CPU） ✓
Embedding 前向传播返回 (1, 256) ✓

5. 精度评测

精度验证

使用 DiarizationErrorRate 指标对比 NPU 与 CPU 输出的一致性：

python3 scripts/evaluate.py

音频时长	NPU vs CPU DER	NPU 推理时间	CPU 推理时间	加速比
10s	0.00%	0.052s	0.733s	14.1x
30s	0.00%	0.088s	7.214s	82.0x
60s	0.00%	0.156s	-	~100x

NPU 输出与 CPU 输出在 Diarization Error Rate 上完全一致（DER = 0.00%），精度无损。

6. 性能参考

6.1 NPU 推理性能（Ascend 910B4）

测试条件：合成音频，单一流水线推理，5 次取平均。

指标	10s 音频	30s 音频	60s 音频
平均推理时间	0.052s	0.088s	0.156s
RTF（实时率）	0.0052	0.0029	0.0026
吞吐（x 实时）	192x	341x	386x
推理时间标准差	0.001s	0.027s	0.008s

6.2 性能分析

短音频（10s）：RTF 0.0052，每秒音频仅需 5.2ms 推理
长音频（60s）：RTF 0.0026，推理吞吐接近 386 倍实时
加速比（vs CPU）：NPU 相比 CPU 推理加速 14x~100x，音频越长优势越明显
推理时间标准差低（0.001~0.027s），性能稳定

7. 适配要点

speaker-diarization-community-1 在 NPU 上运行需要以下适配：

7.1 torchcodec 替换

torchcodec 的 AudioDecoder 依赖 CUDA NVRTC（libnvrtc.so），NPU 环境不可用。使用 soundfile 实现替代解码器，注入 pyannote.audio.core.io 模块。

7.2 FFT 计算路由

WeSpeaker ResNet 的 compute_fbank 使用 torch.fft.rfft 产生 complex64 张量。Ascend NPU 的 aclnnAbs 算子不支持复数输入。参考 MPS 的解决方案，将 FFT 计算路由到 CPU。

7.3 设备检测

pyannote.audio.pipelines.utils.getter.get_devices() 硬编码了 torch.cuda.device_count()，需要补丁替换为 torch.npu.device_count()。

7.4 安全反序列化

PyTorch 2.6+ 默认启用 weights_only=True。pyannote 的 Specifications 和 Problem 类需要通过 torch.serialization.add_safe_globals 注册。

8. 注意事项

模型权重下载：需要从 GitCode 或 HuggingFace 下载完整模型权重（含 segmentation/pytorch_model.bin、embedding/pytorch_model.bin、plda/plda.npz）
音频格式：推荐使用 16kHz 单声道 WAV 格式，与模型训练数据一致
torchcodec 警告：NPU 环境下 torchcodec 加载失败属于正常现象，不影响推理功能
Fbank 计算：FFT 路由到 CPU 会导致 embeddding 推理时存在 CPU↔NPU 数据传输，但对整体性能影响较小（实测 RTF 仍保持在 0.003~0.005）
多卡部署：当前验证基于单 NPU 卡，多卡场景可通过设置 ASCEND_RT_VISIBLE_DEVICES 环境变量控制