mms-tts-eng(Massively Multilingual Speech - English Text-to-Speech)是 Meta AI 开发的 VITS 模型,专门用于英语语音合成。该模型可将文本转换为自然流畅的英语语音波形,支持 16kHz 采样率输出。
mms-tts-eng-ascend/
├── inference.py # 推理测试脚本
├── output_1.wav # 测试音频输出 1
├── output_2.wav # 测试音频输出 2
├── output_3.wav # 测试音频输出 3
├── log.txt # 测试日志
├── README.md # 本文档docker exec -it test-modelagent bashsource /usr/local/Ascend/ascend-toolkit/set_env.sh模型文件位于 /data/ysws/agentsp/5-15/mms-tts-eng/ 目录下:
运行推理脚本进行文本到语音合成:
cd /data/ysws/agentsp/5-15/mms-tts-eng-ascend/
python3 inference.py --mode inference --device npu:0运行精度对比测试,验证 NPU 计算结果与 CPU 一致性:
cd /data/ysws/agentsp/5-15/mms-tts-eng-ascend/
python3 inference.py --mode precision_test| 参数 | 说明 | 默认值 |
|---|---|---|
--mode | 测试模式: inference 或 precision_test | inference |
--device | 运行设备 | npu:0 (自动检测) |
| 指标 | 实测值 | 阈值 | 状态 |
|---|---|---|---|
| Hidden states 相对误差 | 0.0052% | < 1.00% | PASS |
| Hidden states Cosine 相似度 | 1.000000 | > 0.99 | PASS |
| Attention 相对误差 | 0.0656% | < 1.00% | PASS |
| Attention Cosine 相似度 | 1.000000 | > 0.99 | PASS |
| 操作 | 耗时 |
|---|---|
| CPU 推理时间 (1 句) | 7.735s |
| NPU 推理时间 (1 句) | 8.298s |
| 输入句子 | 输出维度 | 采样率 | 推理时间 |
|---|---|---|---|
| "Hello world, this is a test..." | [1, 67328] | 16kHz | 8.179s |
| "The VITS model can generate..." | [1, 58112] | 16kHz | 0.099s |
| "Massively Multilingual Speech..." | [1, 80896] | 16kHz | 0.096s |
2026-05-15 14:39:57,738 - INFO - ============================================================
2026-05-15 14:39:57,738 - INFO - mms-tts-eng NPU 推理测试
2026-05-15 14:39:57,738 - INFO - ============================================================
2026-05-15 14:39:57,738 - INFO - Model dir: /data/ysws/agentsp/5-15/mms-tts-eng
2026-05-15 14:39:57,739 - INFO - Output dir: /data/ysws/agentsp/5-15/mms-tts-eng-ascend
2026-05-15 14:39:57,739 - INFO - NPU available: True
2026-05-15 14:39:57,739 - INFO - NPU device count: 8
2026-05-15 14:39:59,391 - INFO - NPU 0: Ascend910B3, total_memory=61.0GB
2026-05-15 14:39:59,393 - INFO - NPU 1: Ascend910B3, total_memory=61.0GB
2026-05-15 14:39:59,393 - INFO - ============================================================
2026-05-15 14:39:59,393 - INFO - Inference Test on npu:0
2026-05-15 14:39:59,393 - INFO - ============================================================
2026-05-15 14:40:04,574 - INFO - Device: npu:0
2026-05-15 14:40:04,574 - INFO - Loading tokenizer...
2026-05-15 14:40:06,121 - INFO - Model loaded successfully
2026-05-15 14:40:06,121 - INFO - Processing 3 sentences...
2026-05-15 14:40:06,123 - INFO - Sentence 1: "Hello world, this is a test of text to speech."
2026-05-15 14:40:06,123 - INFO - Input IDs shape: torch.Size([1, 89])
2026-05-15 14:40:29,155 - INFO - Waveform shape: torch.Size([1, 65024])
2026-05-15 14:40:29,155 - INFO - Inference time: 7.992s
2026-05-15 14:40:29,156 - INFO - Waveform min/max: -0.5980 / 0.8434
2026-05-15 14:40:29,162 - INFO - Saved audio to: /data/ysws/agentsp/5-15/mms-tts-eng-ascend/output_1.wav
2026-05-15 14:40:29,165 - INFO - Sentence 2: "The VITS model can generate natural sounding speech."
2026-05-15 14:40:29,165 - INFO - Input IDs shape: torch.Size([1, 103])
2026-05-15 14:40:29,327 - INFO - Waveform shape: torch.Size([1, 61952])
2026-05-15 14:40:29,327 - INFO - Inference time: 0.108s
2026-05-15 14:40:29,328 - INFO - Waveform min/max: -0.7546 / 0.8913
2026-05-15 14:40:29,329 - INFO - Saved audio to: /data/ysws/agentsp/5-15/mms-tts-eng-ascend/output_2.wav
2026-05-15 14:40:29,331 - INFO - Sentence 3: "Massively Multilingual Speech enables TTS across many languages."
2026-05-15 14:40:29,331 - INFO - Input IDs shape: torch.Size([1, 127])
2026-05-15 14:40:29,483 - INFO - Waveform shape: torch.Size([1, 89088])
2026-05-15 14:40:29,483 - INFO - Inference time: 0.106s
2026-05-15 14:40:29,483 - INFO - Waveform min/max: -0.7774 / 0.8253
2026-05-15 14:40:29,485 - INFO - Saved audio to: /data/ysws/agentsp/5-15/mms-tts-eng-ascend/output_3.wav
2026-05-15 14:40:29,489 - INFO - ============================================================
2026-05-15 14:40:29,490 - INFO - INFERENCE RESULT
2026-05-15 14:40:29,490 - INFO - ============================================================
2026-05-15 14:40:29,490 - INFO - Output waveform shape: torch.Size([1, 89088])
2026-05-15 14:40:29,490 - INFO - Inference time: 0.106s
2026-05-15 14:40:29,490 - INFO - ============================================================
2026-05-15 14:40:29,490 - INFO - Test Complete!
2026-05-15 14:40:29,490 - INFO - ============================================================2026-05-15 14:33:18,944 - INFO - ============================================================
2026-05-15 14:33:18,945 - INFO - mms-tts-eng NPU 推理测试
2026-05-15 14:33:18,945 - INFO - ============================================================
2026-05-15 14:33:18,945 - INFO - Model dir: /data/ysws/agentsp/5-15/mms-tts-eng
2026-05-15 14:33:18,945 - INFO - Output dir: /data/ysws/agentsp/5-15/mms-tts-eng-ascend
2026-05-15 14:33:18,945 - INFO - NPU available: True
2026-05-15 14:33:18,946 - INFO - NPU device count: 8
2026-05-15 14:33:20,525 - INFO - NPU 0: Ascend910B3, total_memory=61.0GB
2026-05-15 14:33:20,526 - INFO - NPU 1: Ascend910B3, total_memory=61.0GB
2026-05-15 14:33:20,526 - INFO - ============================================================
2026-05-15 14:33:20,526 - INFO - Precision Test: CPU vs NPU (threshold: 1.0%)
2026-05-15 14:33:20,526 - INFO - ============================================================
2026-05-15 14:33:25,933 - INFO - Loading tokenizer...
2026-05-15 14:33:25,942 - INFO - Loading model for CPU...
2026-05-15 14:33:26,989 - INFO - Loading model for NPU...
2026-05-15 14:33:28,446 - INFO - Input text: "This is a precision test for the VITS TTS model."
2026-05-15 14:33:28,446 - INFO - Input IDs shape: torch.Size([1, 95])
2026-05-15 14:33:28,449 - INFO - Running inference on CPU...
2026-05-15 14:33:36,185 - INFO - Running inference on NPU...
2026-05-15 14:33:58,772 - INFO - Waveform CPU shape: torch.Size([1, 54016])
2026-05-15 14:33:58,773 - INFO - Waveform NPU shape: torch.Size([1, 54016])
2026-05-15 14:33:58,773 - INFO - Hidden states CPU shape: (1, 95, 192)
2026-05-15 14:33:58,773 - INFO - Hidden states NPU shape: (1, 95, 192)
2026-05-15 14:33:58,773 - INFO - CPU inference time: 7.735s
2026-05-15 14:33:58,773 - INFO - NPU inference time: 8.298s
2026-05-15 14:33:58,774 - INFO - === Text Encoder Hidden States Precision ===
2026-05-15 14:33:58,775 - INFO - Max relative error: 5.192067e-05 (0.0052%)
2026-05-15 14:33:58,775 - INFO - Cosine similarity: 1.000000 (0.0000% angular error)
2026-05-15 14:33:58,775 - INFO - === Attention Weights Precision ===
2026-05-15 14:33:58,775 - INFO - Max relative error: 6.563108e-04 (0.0656%)
2026-05-15 14:33:58,775 - INFO - Cosine similarity: 1.000000
2026-05-15 14:33:58,775 - INFO - PASS: True (threshold: 1.0%, hidden states cosine similarity: 1.000000)
2026-05-15 14:33:58,783 - INFO - ============================================================
2026-05-15 14:33:58,783 - INFO - PRECISION TEST RESULT
2026-05-15 14:33:58,783 - INFO - ============================================================
2026-05-15 14:33:58,783 - INFO - Spectrogram cosine similarity: 1.000000
2026-05-15 14:33:58,783 - INFO - CPU time: 7.735s
2026-05-15 14:33:58,783 - INFO - NPU time: 8.298s
2026-05-15 14:33:58,783 - INFO - PASS: True
2026-05-15 14:33:58,783 - INFO - ============================================================
2026-05-15 14:33:58,784 - INFO - Test Complete!
2026-05-15 14:33:58,784 - INFO - ============================================================import torch
from transformers import VitsModel, AutoTokenizer
MODEL_DIR = "/data/ysws/agentsp/5-15/mms-tts-eng"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = VitsModel.from_pretrained(MODEL_DIR)
model = model.to("npu:0")
model.eval()
text = "This is a test sentence for text to speech synthesis."
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to("npu:0") for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
waveform = outputs.waveform[0].cpu().numpy()
print(f"Waveform shape: {waveform.shape}") # [num_samples]
print(f"Sample rate: {model.config.sampling_rate}") # 16000import scipy.io.wavfile
waveform = outputs.waveform[0].cpu().numpy()
scipy.io.wavfile.write("output.wav", rate=model.config.sampling_rate, data=waveform)import numpy as np
outputs_cpu = model_cpu(**inputs)
outputs_npu = model_npu(**inputs_npu)
hidden_states_cpu = outputs_cpu.hidden_states[-1].cpu().numpy()
hidden_states_npu = outputs_npu.hidden_states[-1].cpu().numpy()
cosine_sim = np.dot(hidden_states_cpu.flatten(), hidden_states_npu.flatten()) / (
np.linalg.norm(hidden_states_cpu.flatten()) * np.linalg.norm(hidden_states_npu.flatten())
)
print(f"Cosine similarity: {cosine_sim:.6f}") # > 0.99 为合格| 组件 | 说明 |
|---|---|
| text_encoder | Transformer 编码器处理文本输入 |
| duration_predictor | 随机时长预测器 |
| flow | 基于流的声学特征生成 |
| decoder | HiFi-GAN 风格的声码器 |
从 config.json 提取的关键参数:
{
"hidden_size": 192,
"intermediate_size": 768,
"num_attention_heads": 2,
"num_hidden_layers": 6,
"sampling_rate": 16000,
"vocab_size": 38,
"noise_scale": 0.667,
"noise_scale_duration": 0.8,
"speaking_rate": 1.0
}A: VITS 模型使用随机时长预测器,每次推理会产生不同的波形(这是模型设计特性)。精度测试使用文本编码器的隐藏状态(确定性部分)来验证 CPU 和 NPU 的一致性。隐藏状态 cosine similarity > 0.99 即表示模型在 NPU 上工作正常。
A: 检查采样率是否正确设置(应为 16000 Hz)。确保使用 .wav 格式保存音频数据。
A: 首次推理会有编译开销。VITS 模型计算量较大,NPU 推理时间约与 CPU 持平。
本项目遵循 CC-BY-NC-4.0 许可证