WeSpeaker ResNet34-LM 的 MLX 兼容权重,由 pyannote 说话人嵌入模型转换而来,并将 BatchNorm 融合到 Conv2d 中。
WeSpeaker ResNet34-LM 是一个说话人嵌入模型(约 660 万参数),可从音频中生成 256 维 L2 归一化说话人嵌入。该模型在 VoxCeleb 数据集上进行训练,适用于说话人验证和说话人分轨任务。
架构:
Input: [B, T, 80, 1] log-mel spectrogram (80 fbank, 16kHz)
│
├─ Conv2d(1→32, k=3, p=1) + ReLU
├─ Layer1: 3× BasicBlock(32→32)
├─ Layer2: 4× BasicBlock(32→64, stride=2)
├─ Layer3: 6× BasicBlock(64→128, stride=2)
├─ Layer4: 3× BasicBlock(128→256, stride=2)
│
├─ Statistics Pooling: mean + std → [B, 5120]
├─ Linear(5120→256) → L2 normalize
│
Output: [B, 256] speaker embedding在转换时,BatchNorm 会被融合到 Conv2d 中——MLX 模型中不存在 BN 层。
import SpeechVAD
// Speaker embedding
let model = try await WeSpeakerModel.fromPretrained()
let embedding = model.embed(audio: samples, sampleRate: 16000)
// embedding: [Float] of length 256, L2-normalized
// Compare speakers
let similarity = WeSpeakerModel.cosineSimilarity(embeddingA, embeddingB)
// Full speaker diarization pipeline
let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")
}speech-swift 的一部分。
python3 scripts/convert_wespeaker.py --upload使用自定义解序列化器转换原始 pyannote/wespeaker-voxceleb-resnet34-LM 检查点(无需 pyannote.audio 依赖)。主要转换如下:
w_fused = w × γ/√(σ²+ε),b_fused = β − μ×γ/√(σ²+ε)[O, I, H, W] → [O, H, W, I],以适应 MLX 的通道最后格式resnet. 前缀,seg_1 → embeddingnum_batches_tracked 键| PyTorch 键 | MLX 键 | 形状 |
|---|---|---|
resnet.conv1.weight + resnet.bn1.* | conv1.weight | [32, 3, 3, 1] |
resnet.layer{L}.{B}.conv{1,2}.weight + bn{1,2}.* | layer{L}.{B}.conv{1,2}.weight | [O, 3, 3, I] |
resnet.layer{L}.0.shortcut.0.weight + shortcut.1.* | layer{L}.0.shortcut.weight | [O, 1, 1, I] |
resnet.seg_1.weight | embedding.weight | [256, 5120] |
resnet.seg_1.bias | embedding.bias | [256] |
原始 WeSpeaker 模型基于 MIT 许可证 发布。