smart-turn-v2 是基于 Wav2Vec2 的语音端点检测模型,专为语音处理任务优化。该模型能够对音频信号进行特征提取和处理,适用于语音识别、说话人识别等任务的前端处理。
smart-turn-v2-ascend/
├── inference.py # 推理测试脚本
├── log.txt # 测试日志
├── README.md # 本文档
├── test_audio.npy # 测试音频样本
├── inference_result.json # 推理结果
└── precision_result.json # 精度测试结果docker exec -it test-modelagent bashsource /usr/local/Ascend/ascend-toolkit/set_env.sh模型文件位于 /data/ysws/agentsp/5-16/smart-turn-v2/pipecat-ai/smart-turn-v2/ 目录下:
pip install transformers torch_npu numpy scipy -i https://pypi.huaweicloud.com/repository/pypi/simple/Run the inference script to extract audio features:
cd /data/ysws/agentsp/5-16/smart-turn-v2-ascend/
python3 inference.py
python3 inference.py --mode inference运行精度对比测试:
cd /data/ysws/agentsp/5-16/smart-turn-v2-ascend/
python3 inference.py --mode precision_test| 参数 | 说明 | 默认值 |
|---|---|---|
--mode | 测试模式: all, inference 或 precision_test | all |
| 指标 | 实测值 | 阈值 | 状态 |
|---|---|---|---|
| 最大相对误差 | 0.6073% | < 1.00% | PASS |
| CPU 推理时间 | 2.314s | - | - |
| NPU 推理时间 | 0.016s | - | - |
| 加速比 | 148.94x | > 1x | PASS |
输入: 3 秒音频 (48000 samples, 16kHz 采样率)
输出:
smart-turn-v2 NPU Test
Model: pipecat-ai/smart-turn-v2 (wav2vec2 endpointing)
Output: /data/ysws/agentsp/5-16/smart-turn-v2-ascend
============================================================
Inference Test (NPU)
============================================================
Device: npu:0
Loading model and feature extractor...
Model loaded successfully
Audio length: 48000 samples (3s)
Input shape: torch.Size([1, 48000])
Inference time: 5.196s
Last hidden state shape: torch.Size([1, 149, 768])
Saved test audio: /data/ysws/agentsp/5-16/smart-turn-v2-ascend/test_audio.npy (3s)
============================================================
Precision Test (CPU vs NPU)
============================================================
NPU Device: npu:0
Loading model...
Audio length: 48000 samples (3s)
Running on CPU...
Running on NPU...
CPU inference time: 2.314s
NPU inference time: 0.016s
Speedup: 148.94x
Max absolute error: 5.409449e-03
Max relative error: 0.6073% (threshold: 1.0%)
Status: PASS
============================================================
Precision Test Result: PASS
============================================================
============================================================
Test Complete!
============================================================import torch
import numpy as np
from transformers import AutoModel, Wav2Vec2FeatureExtractor
MODEL_DIR = "/data/ysws/agentsp/5-16/smart-turn-v2/pipecat-ai/smart-turn-v2"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(MODEL_DIR)
model = AutoModel.from_pretrained(MODEL_DIR, trust_remote_code=True)
model = model.to("npu:0").eval()
audio_data = np.random.randn(16000).astype(np.float32) * 0.01
inputs = feature_extractor(audio_data, sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to("npu:0") for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
hidden_states = outputs.last_hidden_state
print(f"Hidden states shape: {hidden_states.shape}")audio_batch = [np.random.randn(16000).astype(np.float32) * 0.01 for _ in range(4)]
inputs = feature_extractor(audio_batch, sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to("npu:0") for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)| 组件 | 说明 |
|---|---|
| feature_extractor | CNN 特征提取器 |
| feature_projection | 特征投影层 |
| encoder | 12 层 Transformer 编码器 |
| masked_spec_embed | 掩码特征嵌入 |
从 config.json 提取的关键参数:
{
"hidden_size": 768,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"vocab_size": 32,
"conv_dim": [512, 512, 512, 512, 512, 512, 512],
"conv_stride": [5, 2, 2, 2, 2, 2, 2],
"sampling_rate": 16000
}A: 检查 NPU 驱动是否正确安装。Wav2Vec2 模型在 CPU 和 NPU 上的数值误差极小(< 0.7%),远低于 1% 阈值。
A: NPU 相比 CPU 有极其显著的加速(148x),适合实时语音处理场景。
A: 理论上无限制,但过长音频会占用更多内存。建议分段处理。
本项目遵循 Apache-2.0 许可证