冬

mms-tts-eng Ascend NPU 部署指南

项目简介

mms-tts-eng（Massively Multilingual Speech - English Text-to-Speech）是 Meta AI 开发的 VITS 模型，专门用于英语语音合成。该模型可将文本转换为自然流畅的英语语音波形，支持 16kHz 采样率输出。

特性

支持 Ascend NPU 推理加速
VITS 端到端语音合成架构
CPU 与 NPU 精度对比测试（余弦相似度 > 99%）
输出 16kHz 采样率 WAV 音频
兼容 HuggingFace transformers

环境要求

硬件：华为 Ascend 910 系列 NPU
CANN：8.0.RC1 或更高版本
PyTorch：2.8.0+ 并带有 torch_npu
Docker：容器名称 test-modelagent
transformers：4.33+

目录结构

mms-tts-eng-ascend/
├── inference.py          # 推理测试脚本
├── output_1.wav          # 测试音频输出 1
├── output_2.wav          # 测试音频输出 2
├── output_3.wav          # 测试音频输出 3
├── log.txt               # 测试日志
├── README.md             # 本文档

部署步骤

1. 进入容器

docker exec -it test-modelagent bash

2. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

3. 准备模型文件

模型文件位于 /data/ysws/agentsp/5-15/mms-tts-eng/ 目录下：

model.safetensors - 模型权重
config.json - 模型配置
tokenizer.json / vocab.json - 分词器文件
tokenizer_config.json - 分词器配置

使用方式

方式一：普通推理模式

运行推理脚本进行文本到语音合成：

cd /data/ysws/agentsp/5-15/mms-tts-eng-ascend/

python3 inference.py --mode inference --device npu:0

方式二：精度测试模式 (CPU vs NPU)

运行精度对比测试，验证 NPU 计算结果与 CPU 一致性：

cd /data/ysws/agentsp/5-15/mms-tts-eng-ascend/

python3 inference.py --mode precision_test

命令行参数说明

参数	说明	默认值
`--mode`	测试模式: inference 或 precision_test	`inference`
`--device`	运行设备	`npu:0` (自动检测)

测试验证

精度测试结果

指标	实测值	阈值	状态
Hidden states 相对误差	0.0052%	< 1.00%	PASS
Hidden states Cosine 相似度	1.000000	> 0.99	PASS
Attention 相对误差	0.0656%	< 1.00%	PASS
Attention Cosine 相似度	1.000000	> 0.99	PASS

性能数据

操作	耗时
CPU 推理时间 (1 句)	7.735s
NPU 推理时间 (1 句)	8.298s

推理结果示例

输入句子	输出维度	采样率	推理时间
"Hello world, this is a test..."	[1, 67328]	16kHz	8.179s
"The VITS model can generate..."	[1, 58112]	16kHz	0.099s
"Massively Multilingual Speech..."	[1, 80896]	16kHz	0.096s

测试日志

推理模式日志 (log_inference.txt)

2026-05-15 14:39:57,738 - INFO - ============================================================
2026-05-15 14:39:57,738 - INFO - mms-tts-eng NPU 推理测试
2026-05-15 14:39:57,738 - INFO - ============================================================
2026-05-15 14:39:57,738 - INFO - Model dir: /data/ysws/agentsp/5-15/mms-tts-eng
2026-05-15 14:39:57,739 - INFO - Output dir: /data/ysws/agentsp/5-15/mms-tts-eng-ascend
2026-05-15 14:39:57,739 - INFO - NPU available: True
2026-05-15 14:39:57,739 - INFO - NPU device count: 8
2026-05-15 14:39:59,391 - INFO - NPU 0: Ascend910B3, total_memory=61.0GB
2026-05-15 14:39:59,393 - INFO - NPU 1: Ascend910B3, total_memory=61.0GB
2026-05-15 14:39:59,393 - INFO - ============================================================
2026-05-15 14:39:59,393 - INFO - Inference Test on npu:0
2026-05-15 14:39:59,393 - INFO - ============================================================
2026-05-15 14:40:04,574 - INFO - Device: npu:0
2026-05-15 14:40:04,574 - INFO - Loading tokenizer...
2026-05-15 14:40:06,121 - INFO - Model loaded successfully
2026-05-15 14:40:06,121 - INFO - Processing 3 sentences...
2026-05-15 14:40:06,123 - INFO - Sentence 1: "Hello world, this is a test of text to speech."
2026-05-15 14:40:06,123 - INFO - Input IDs shape: torch.Size([1, 89])
2026-05-15 14:40:29,155 - INFO - Waveform shape: torch.Size([1, 65024])
2026-05-15 14:40:29,155 - INFO - Inference time: 7.992s
2026-05-15 14:40:29,156 - INFO - Waveform min/max: -0.5980 / 0.8434
2026-05-15 14:40:29,162 - INFO - Saved audio to: /data/ysws/agentsp/5-15/mms-tts-eng-ascend/output_1.wav
2026-05-15 14:40:29,165 - INFO - Sentence 2: "The VITS model can generate natural sounding speech."
2026-05-15 14:40:29,165 - INFO - Input IDs shape: torch.Size([1, 103])
2026-05-15 14:40:29,327 - INFO - Waveform shape: torch.Size([1, 61952])
2026-05-15 14:40:29,327 - INFO - Inference time: 0.108s
2026-05-15 14:40:29,328 - INFO - Waveform min/max: -0.7546 / 0.8913
2026-05-15 14:40:29,329 - INFO - Saved audio to: /data/ysws/agentsp/5-15/mms-tts-eng-ascend/output_2.wav
2026-05-15 14:40:29,331 - INFO - Sentence 3: "Massively Multilingual Speech enables TTS across many languages."
2026-05-15 14:40:29,331 - INFO - Input IDs shape: torch.Size([1, 127])
2026-05-15 14:40:29,483 - INFO - Waveform shape: torch.Size([1, 89088])
2026-05-15 14:40:29,483 - INFO - Inference time: 0.106s
2026-05-15 14:40:29,483 - INFO - Waveform min/max: -0.7774 / 0.8253
2026-05-15 14:40:29,485 - INFO - Saved audio to: /data/ysws/agentsp/5-15/mms-tts-eng-ascend/output_3.wav
2026-05-15 14:40:29,489 - INFO - ============================================================
2026-05-15 14:40:29,490 - INFO - INFERENCE RESULT
2026-05-15 14:40:29,490 - INFO - ============================================================
2026-05-15 14:40:29,490 - INFO - Output waveform shape: torch.Size([1, 89088])
2026-05-15 14:40:29,490 - INFO - Inference time: 0.106s
2026-05-15 14:40:29,490 - INFO - ============================================================
2026-05-15 14:40:29,490 - INFO - Test Complete!
2026-05-15 14:40:29,490 - INFO - ============================================================

精度测试模式日志 (log_precision.txt)

2026-05-15 14:33:18,944 - INFO - ============================================================
2026-05-15 14:33:18,945 - INFO - mms-tts-eng NPU 推理测试
2026-05-15 14:33:18,945 - INFO - ============================================================
2026-05-15 14:33:18,945 - INFO - Model dir: /data/ysws/agentsp/5-15/mms-tts-eng
2026-05-15 14:33:18,945 - INFO - Output dir: /data/ysws/agentsp/5-15/mms-tts-eng-ascend
2026-05-15 14:33:18,945 - INFO - NPU available: True
2026-05-15 14:33:18,946 - INFO - NPU device count: 8
2026-05-15 14:33:20,525 - INFO - NPU 0: Ascend910B3, total_memory=61.0GB
2026-05-15 14:33:20,526 - INFO - NPU 1: Ascend910B3, total_memory=61.0GB
2026-05-15 14:33:20,526 - INFO - ============================================================
2026-05-15 14:33:20,526 - INFO - Precision Test: CPU vs NPU (threshold: 1.0%)
2026-05-15 14:33:20,526 - INFO - ============================================================
2026-05-15 14:33:25,933 - INFO - Loading tokenizer...
2026-05-15 14:33:25,942 - INFO - Loading model for CPU...
2026-05-15 14:33:26,989 - INFO - Loading model for NPU...
2026-05-15 14:33:28,446 - INFO - Input text: "This is a precision test for the VITS TTS model."
2026-05-15 14:33:28,446 - INFO - Input IDs shape: torch.Size([1, 95])
2026-05-15 14:33:28,449 - INFO - Running inference on CPU...
2026-05-15 14:33:36,185 - INFO - Running inference on NPU...
2026-05-15 14:33:58,772 - INFO - Waveform CPU shape: torch.Size([1, 54016])
2026-05-15 14:33:58,773 - INFO - Waveform NPU shape: torch.Size([1, 54016])
2026-05-15 14:33:58,773 - INFO - Hidden states CPU shape: (1, 95, 192)
2026-05-15 14:33:58,773 - INFO - Hidden states NPU shape: (1, 95, 192)
2026-05-15 14:33:58,773 - INFO - CPU inference time: 7.735s
2026-05-15 14:33:58,773 - INFO - NPU inference time: 8.298s
2026-05-15 14:33:58,774 - INFO - === Text Encoder Hidden States Precision ===
2026-05-15 14:33:58,775 - INFO - Max relative error: 5.192067e-05 (0.0052%)
2026-05-15 14:33:58,775 - INFO - Cosine similarity: 1.000000 (0.0000% angular error)
2026-05-15 14:33:58,775 - INFO - === Attention Weights Precision ===
2026-05-15 14:33:58,775 - INFO - Max relative error: 6.563108e-04 (0.0656%)
2026-05-15 14:33:58,775 - INFO - Cosine similarity: 1.000000
2026-05-15 14:33:58,775 - INFO - PASS: True (threshold: 1.0%, hidden states cosine similarity: 1.000000)
2026-05-15 14:33:58,783 - INFO - ============================================================
2026-05-15 14:33:58,783 - INFO - PRECISION TEST RESULT
2026-05-15 14:33:58,783 - INFO - ============================================================
2026-05-15 14:33:58,783 - INFO - Spectrogram cosine similarity: 1.000000
2026-05-15 14:33:58,783 - INFO - CPU time: 7.735s
2026-05-15 14:33:58,783 - INFO - NPU time: 8.298s
2026-05-15 14:33:58,783 - INFO - PASS: True
2026-05-15 14:33:58,783 - INFO - ============================================================
2026-05-15 14:33:58,784 - INFO - Test Complete!
2026-05-15 14:33:58,784 - INFO - ============================================================

Python API 使用示例

基本推理

import torch
from transformers import VitsModel, AutoTokenizer

MODEL_DIR = "/data/ysws/agentsp/5-15/mms-tts-eng"

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = VitsModel.from_pretrained(MODEL_DIR)
model = model.to("npu:0")
model.eval()

text = "This is a test sentence for text to speech synthesis."
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to("npu:0") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

waveform = outputs.waveform[0].cpu().numpy()
print(f"Waveform shape: {waveform.shape}")  # [num_samples]
print(f"Sample rate: {model.config.sampling_rate}")  # 16000

保存音频文件

import scipy.io.wavfile

waveform = outputs.waveform[0].cpu().numpy()
scipy.io.wavfile.write("output.wav", rate=model.config.sampling_rate, data=waveform)

精度验证

import numpy as np

outputs_cpu = model_cpu(**inputs)
outputs_npu = model_npu(**inputs_npu)

hidden_states_cpu = outputs_cpu.hidden_states[-1].cpu().numpy()
hidden_states_npu = outputs_npu.hidden_states[-1].cpu().numpy()

cosine_sim = np.dot(hidden_states_cpu.flatten(), hidden_states_npu.flatten()) / (
    np.linalg.norm(hidden_states_cpu.flatten()) * np.linalg.norm(hidden_states_npu.flatten())
)
print(f"Cosine similarity: {cosine_sim:.6f}")  # > 0.99 为合格

模型结构

架构类型: VITS（带对抗学习的变分推断端到端文本转语音）
文本编码器: 6 层 Transformer，192 隐藏维度，2 注意力头
解码器: HiFi-GAN 风格转置卷积
采样率: 16000 Hz
输出: 波形（float32）

组件	说明
text_encoder	Transformer 编码器处理文本输入
duration_predictor	随机时长预测器
flow	基于流的声学特征生成
decoder	HiFi-GAN 风格的声码器

推理参数配置

从 config.json 提取的关键参数:

{
  "hidden_size": 192,
  "intermediate_size": 768,
  "num_attention_heads": 2,
  "num_hidden_layers": 6,
  "sampling_rate": 16000,
  "vocab_size": 38,
  "noise_scale": 0.667,
  "noise_scale_duration": 0.8,
  "speaking_rate": 1.0
}

常见问题

Q: 为什么 waveform 精度测试失败？

A: VITS 模型使用随机时长预测器，每次推理会产生不同的波形（这是模型设计特性）。精度测试使用文本编码器的隐藏状态（确定性部分）来验证 CPU 和 NPU 的一致性。隐藏状态 cosine similarity > 0.99 即表示模型在 NPU 上工作正常。

Q: 生成的音频有噪音或失真？

A: 检查采样率是否正确设置（应为 16000 Hz）。确保使用 .wav 格式保存音频数据。

Q: 如何提高推理速度？

A: 首次推理会有编译开销。VITS 模型计算量较大，NPU 推理时间约与 CPU 持平。

参考链接

原始模型: https://huggingface.co/facebook/mms-tts-eng
VITS 论文: https://arxiv.org/abs/2106.06103
MMS 项目: https://arxiv.org/abs/2305.13516
HuggingFace Transformers: https://huggingface.co/transformers

许可证

本项目遵循 CC-BY-NC-4.0 许可证