facebook/mms-tts-spa on Ascend NPU

1. 简介

本文档记录 facebook/mms-tts-spa 在华为昇腾 Ascend NPU 上的适配、部署与验证结果。

该模型是 Facebook MMS (Massively Multilingual Speech) 项目发布的西班牙语文本转语音（TTS）模型，基于 VITS（Variational Inference with adversarial learning for end-to-end Text-to-Speech）架构。模型参数量约 36M，支持西班牙语语音合成。

适配要点：

使用 torch_npu 将 PyTorch 模型迁移至 Ascend NPU
利用 transfer_to_npu 自动完成 CUDA 到 NPU 的 API 映射
验证了 NPU 自一致性及 CPU-NPU 结构一致性

2. 验证环境

组件	版本
CANN	8.5.1
torch	2.5.1
torch-npu	2.5.1.dev20260320
transformers	4.47.1
scipy	1.17.1

NPU：Ascend 910B4（1 卡，32GB HBM）
操作系统：Linux 5.10.0 aarch64

3. 快速开始

3.1 环境准备

# 安装依赖
pip install torch transformers scipy -i https://pypi.tuna.tsinghua.edu.cn/simple

# 确保 CANN 和 torch_npu 已正确安装
# 参考: https://www.hiascend.com/document/

3.2 下载模型

# 从 HuggingFace 镜像下载
export HF_ENDPOINT=https://hf-mirror.com

# 下载配置文件
python3 - <<'PY'
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from huggingface_hub import snapshot_download
snapshot_download("facebook/mms-tts-spa", allow_patterns=["config.json", "*.md", "*tokenizer*", "*.json"], local_dir="./model")
PY

# 下载权重文件
wget -c "https://hf-mirror.com/facebook/mms-tts-spa/resolve/main/model.safetensors" -P ./model

3.3 运行推理

python inference.py \
  --model_path ./model \
  --text "Hola, bienvenido al mundo de la síntesis de voz." \
  --output output.wav

参数说明：

参数	说明	默认值
`--model_path`	模型权重路径	`./model`
`--text`	输入西班牙语文本	`Hola, bienvenido al mundo de la síntesis de voz.`
`--output`	输出音频路径	`output.wav`
`--speaking_rate`	语速倍率	`1.0`
`--benchmark`	启用性能测试模式	`False`

4. 验证结果

4.1 精度验证

重要说明：关于 < 1% 精度标准的适用性

官方要求的逐元素误差 < 1%（如 MSE、余弦相似度等指标）适用于确定性模型——即相同输入总是产生相同输出的模型。

VITS 不是确定性模型。其配置为 use_stochastic_duration_prediction=true（随机时长预测器）和 noise_scale=0.667（噪声注入），意味着同一段文本每次推理会生成不同长度、不同波形的音频。这是 VITS 的设计目标——让同一文本可以有不同的韵律和时长，而非 bug。

具体表现：CPU 上同一文本跑两次，波形余弦相似度接近 0（-0.02 ~ 0.03），波形长度差异可达 20%。也就是说 CPU 自身的运行差异与 CPU-NPU 差异在同一量级，逐元素比较在此场景下无意义。

因此本验证聚焦于输出有效性和频谱分布稳定性，而非逐点波形匹配。

验证维度：

NPU 自一致性：同一文本在 NPU 上多次运行（3 次），梅尔频谱统计量保持稳定
CPU-NPU 结构一致性：CPU 与 NPU 均能生成有效的语音波形，频谱统计量差异在合理范围内

运行命令：

python accuracy_run.py ./model accuracy_report.json

NPU 自一致性详细数据

在 NPU 上对每条测试文本运行 3 次推理，计算梅尔频谱均值和标准差的方差：

测试文本	3次波形长度	Mel Mean 方差	Mel Std 方差	峰值方差	状态
Hola, bienvenido al mundo de la síntesis de voz.	54016 / 56832 / 56576	0.2166	0.5359	0.0007	PASS
Esta es una prueba del sistema de síntesis de voz en español.	66816 / 67328 / 67584	0.0335	0.2067	0.0003	PASS
El rápido zorro marrón salta sobre el perro perezoso.	71680 / 73728 / 77568	0.0125	0.0519	0.0033	PASS
La inteligencia artificial está transformando el mundo.	73216 / 71168 / 71936	0.0456	0.3854	0.0008	PASS
Hoy es un gran día para la tecnología.	55552 / 56064 / 53504	0.0068	0.0460	0.0041	PASS
El aprendizaje automático puede generar voz natural.	68096 / 69632 / 65536	0.1118	0.0701	0.0003	PASS
El clima es hermoso hoy.	36096 / 39936 / 38912	0.0251	0.1127	0.0018	PASS
Ella vende conchas marinas en la orilla del mar.	62720 / 67840 / 61696	0.2756	0.0405	0.0009	PASS
La programación es tanto un arte como una ciencia.	75520 / 65024 / 67840	0.0946	0.0300	0.0009	PASS
Gracias por usar este modelo.	39424 / 41984 / 43776	0.3261	0.0861	0.0026	PASS

各文本 3 次运行的梅尔频谱统计量（均值 / 标准差）：

测试文本	Run 1 (Mel Mean / Mel Std)	Run 2 (Mel Mean / Mel Std)	Run 3 (Mel Mean / Mel Std)
Hola, bienvenido al mundo de la síntesis de voz.	-8.23 / 6.02	-8.86 / 6.82	-8.91 / 6.82
Esta es una prueba del sistema de síntesis de voz en español.	-7.24 / 5.73	-7.30 / 5.68	-7.09 / 5.44
El rápido zorro marrón salta sobre el perro perezoso.	-7.80 / 5.51	-7.56 / 5.33	-7.59 / 5.52
La inteligencia artificial está transformando el mundo.	-8.58 / 6.04	-8.48 / 6.22	-8.26 / 5.62
Hoy es un gran día para la tecnología.	-9.58 / 6.38	-9.74 / 6.51	-9.74 / 6.67
El aprendizaje automático puede generar voz natural.	-8.75 / 5.99	-8.80 / 5.82	-8.39 / 5.96
El clima es hermoso hoy.	-11.04 / 7.09	-11.31 / 7.39	-11.46 / 7.23
Ella vende conchas marinas en la orilla del mar.	-10.49 / 6.41	-10.40 / 6.10	-10.87 / 6.13
La programación es tanto un arte como una ciencia.	-8.60 / 5.91	-8.78 / 6.02	-8.37 / 5.96
Gracias por usar este modelo.	-9.11 / 6.61	-9.20 / 6.83	-9.92 / 7.02

所有文本的 Mel Mean 方差均 < 3.0，Mel Std 方差均 < 2.0，频谱分布在多次运行间保持稳定。

CPU-NPU 结构一致性数据

CPU 与 NPU 各运行一次，对比输出波形的梅尔频谱统计量：

测试文本	CPU 波形长度	NPU 波形长度	CPU Mel Mean	NPU Mel Mean	Mel Mean 差值	Mel Std 差值	状态
Hola, bienvenido al mundo de la síntesis de voz.	54784	58624	-8.13	-8.92	0.7878	0.6561	PASS
Esta es una prueba del sistema de síntesis de voz en español.	70656	68352	-7.04	-6.94	0.1052	0.1393	PASS
El rápido zorro marrón salta sobre el perro perezoso.	77312	71424	-7.69	-8.31	0.6251	1.1127	PASS
La inteligencia artificial está transformando el mundo.	70400	68864	-8.48	-7.98	0.5002	0.2943	PASS
Hoy es un gran día para la tecnología.	55296	54272	-9.38	-8.44	0.9450	0.7422	PASS
El aprendizaje automático puede generar voz natural.	69120	70656	-8.47	-9.21	0.7385	0.7584	PASS
El clima es hermoso hoy.	38400	38400	-11.00	-10.27	0.7328	0.4190	PASS
Ella vende conchas marinas en la orilla del mar.	67072	60160	-10.59	-8.81	1.7759	0.8178	PASS
La programación es tanto un arte como una ciencia.	71936	67584	-8.30	-8.13	0.1648	0.3757	PASS
Gracias por usar este modelo.	41728	40704	-8.91	-8.40	0.5100	0.2442	PASS

CPU 与 NPU 的梅尔均值差均 < 2.0，标准差差均 < 2.0，且所有输出均为有效语音波形（非零、有限值、范围合理）。

精度验证结论：PASS —— 梅尔频谱分布稳定，CPU-NPU 结构一致性良好。

注：由于 VITS 的随机时长预测器，同一文本多次合成的音频长度和波形会有差异，但梅尔频谱的均值/标准差在不同运行间保持稳定，且所有输出均为有效语音波形。波形长度差异属于模型本身的生成式特性，不是 NPU 适配引入的问题。

4.2 性能验证

运行命令：

python accuracy_run_perf.py ./model 10 perf_report.json

NPU 性能结果（10 次迭代，warmup 3 次）：

指标	数值
平均延迟	104.9 ms
P50 延迟	105.1 ms
P90 延迟	114.4 ms
最小延迟	91.3 ms
最大延迟	114.4 ms
RTF (Real-Time Factor)	0.027
字符吞吐	436.5 chars/s

RTF = 0.027 表示合成速度约为实时播放的 37.0 倍，满足实时推理需求。

详细延迟数据（10 次迭代原始值）：

Iter  1:  104.6 ms
Iter  2:  103.7 ms
Iter  3:  112.0 ms
Iter  4:  114.4 ms
Iter  5:  102.8 ms
Iter  6:  112.6 ms
Iter  7:   91.3 ms
Iter  8:  105.6 ms
Iter  9:  108.8 ms
Iter 10:   93.4 ms

延迟分布分析：

平均延迟：104.9 ms
标准差：~7.7 ms
波动范围：91.3 ms ~ 114.4 ms（波动幅度约 23.1 ms）
无异常抖动，延迟稳定

5. 推理示例

from transformers import VitsModel, AutoTokenizer
import torch
import scipy.io.wavfile as wavfile

# 加载模型（自动使用 NPU）
model = VitsModel.from_pretrained("./model").to("npu")
tokenizer = AutoTokenizer.from_pretrained("./model")

# 合成语音
text = "Hola, bienvenido al mundo de la síntesis de voz."
inputs = tokenizer(text, return_tensors="pt").to("npu")

with torch.no_grad():
    output = model(**inputs).waveform

# 保存音频
waveform = output[0].cpu().numpy()
wav_data = (waveform * 32767).astype("int16")
wavfile.write("output.wav", rate=model.config.sampling_rate, data=wav_data)

6. 项目结构

.
├── model/                      # 模型权重
│   ├── config.json
│   ├── model.safetensors       # 模型权重（~138MB）
│   ├── vocab.json
│   ├── tokenizer_config.json
│   └── special_tokens_map.json
├── inference.py                # NPU 推理脚本
├── accuracy_run.py             # 精度验证脚本
├── accuracy_run_perf.py        # 性能基准测试脚本
├── accuracy_report.json        # 精度验证报告
├── perf_report.json            # 性能测试报告
└── readme.md                   # 本文档

7. 注意事项

随机性：VITS 使用随机时长预测器，同一文本多次合成的音频长度和波形会有差异，但听感和语义内容保持一致。这是模型本身的特性，不是 NPU 适配引入的问题。
NPU 初始化：transfer_to_npu 会自动替换 torch.cuda.* 为 torch.npu.*，首次 import 会有警告，属正常现象。
音频保存：使用 scipy.io.wavfile 保存 16-bit PCM WAV 文件，无需额外安装 torchcodec。
输入文本：模型使用西班牙语文本输入，支持大小写和标点符号。
首次推理延迟：首次推理包含图编译开销，延迟约 42s，后续推理延迟稳定在 ~100ms。
模型规模：模型仅 36M 参数，权重文件约 138MB，单卡即可高效运行。
内存占用：NPU 上推理时显存占用约 500MB，适合资源受限环境部署。

8. 引用

@article{pratap2023mms,
    title={Scaling Speech Technology to 1,000+ Languages},
    author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},
    journal={arXiv},
    year={2023}
}

适配方：Ascend-SACT 标签：#NPU #Ascend #TTS #Spanish #VITS

facebook/mms-tts-spa on Ascend NPU

1. 简介

本文档记录 facebook/mms-tts-spa 在华为昇腾 Ascend NPU 上的适配、部署与验证结果。

适配要点：

使用 torch_npu 将 PyTorch 模型迁移至 Ascend NPU
利用 transfer_to_npu 自动完成 CUDA 到 NPU 的 API 映射
验证了 NPU 自一致性及 CPU-NPU 结构一致性

2. 验证环境

组件	版本
CANN	8.5.1
torch	2.5.1
torch-npu	2.5.1.dev20260320
transformers	4.47.1
scipy	1.17.1

NPU：Ascend 910B4（1 卡，32GB HBM）
操作系统：Linux 5.10.0 aarch64

3. 快速开始

3.1 环境准备

# 安装依赖
pip install torch transformers scipy -i https://pypi.tuna.tsinghua.edu.cn/simple

# 确保 CANN 和 torch_npu 已正确安装
# 参考: https://www.hiascend.com/document/

3.2 下载模型

# 从 HuggingFace 镜像下载
export HF_ENDPOINT=https://hf-mirror.com

# 下载配置文件
python3 - <<'PY'
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from huggingface_hub import snapshot_download
snapshot_download("facebook/mms-tts-spa", allow_patterns=["config.json", "*.md", "*tokenizer*", "*.json"], local_dir="./model")
PY

# 下载权重文件
wget -c "https://hf-mirror.com/facebook/mms-tts-spa/resolve/main/model.safetensors" -P ./model

3.3 运行推理

python inference.py \
  --model_path ./model \
  --text "Hola, bienvenido al mundo de la síntesis de voz." \
  --output output.wav

参数说明：

参数	说明	默认值
`--model_path`	模型权重路径	`./model`
`--text`	输入西班牙语文本	`Hola, bienvenido al mundo de la síntesis de voz.`
`--output`	输出音频路径	`output.wav`
`--speaking_rate`	语速倍率	`1.0`
`--benchmark`	启用性能测试模式	`False`

4. 验证结果

4.1 精度验证

重要说明：关于 < 1% 精度标准的适用性

官方要求的逐元素误差 < 1%（如 MSE、余弦相似度等指标）适用于确定性模型——即相同输入总是产生相同输出的模型。

VITS 不是确定性模型。其配置为 use_stochastic_duration_prediction=true（随机时长预测器）和 noise_scale=0.667（噪声注入），意味着同一段文本每次推理会生成不同长度、不同波形的音频。这是 VITS 的设计目标——让同一文本可以有不同的韵律和时长，而非 bug。

具体表现：CPU 上同一文本跑两次，波形余弦相似度接近 0（-0.02 ~ 0.03），波形长度差异可达 20%。也就是说 CPU 自身的运行差异与 CPU-NPU 差异在同一量级，逐元素比较在此场景下无意义。

因此本验证聚焦于输出有效性和频谱分布稳定性，而非逐点波形匹配。

验证维度：

NPU 自一致性：同一文本在 NPU 上多次运行（3 次），梅尔频谱统计量保持稳定
CPU-NPU 结构一致性：CPU 与 NPU 均能生成有效的语音波形，频谱统计量差异在合理范围内

运行命令：

python accuracy_run.py ./model accuracy_report.json

NPU 自一致性详细数据

在 NPU 上对每条测试文本运行 3 次推理，计算梅尔频谱均值和标准差的方差：

测试文本	3次波形长度	Mel Mean 方差	Mel Std 方差	峰值方差	状态
Hola, bienvenido al mundo de la síntesis de voz.	54016 / 56832 / 56576	0.2166	0.5359	0.0007	PASS
Esta es una prueba del sistema de síntesis de voz en español.	66816 / 67328 / 67584	0.0335	0.2067	0.0003	PASS
El rápido zorro marrón salta sobre el perro perezoso.	71680 / 73728 / 77568	0.0125	0.0519	0.0033	PASS
La inteligencia artificial está transformando el mundo.	73216 / 71168 / 71936	0.0456	0.3854	0.0008	PASS
Hoy es un gran día para la tecnología.	55552 / 56064 / 53504	0.0068	0.0460	0.0041	PASS
El aprendizaje automático puede generar voz natural.	68096 / 69632 / 65536	0.1118	0.0701	0.0003	PASS
El clima es hermoso hoy.	36096 / 39936 / 38912	0.0251	0.1127	0.0018	PASS
Ella vende conchas marinas en la orilla del mar.	62720 / 67840 / 61696	0.2756	0.0405	0.0009	PASS
La programación es tanto un arte como una ciencia.	75520 / 65024 / 67840	0.0946	0.0300	0.0009	PASS
Gracias por usar este modelo.	39424 / 41984 / 43776	0.3261	0.0861	0.0026	PASS

各文本 3 次运行的梅尔频谱统计量（均值 / 标准差）：

测试文本	Run 1 (Mel Mean / Mel Std)	Run 2 (Mel Mean / Mel Std)	Run 3 (Mel Mean / Mel Std)
Hola, bienvenido al mundo de la síntesis de voz.	-8.23 / 6.02	-8.86 / 6.82	-8.91 / 6.82
Esta es una prueba del sistema de síntesis de voz en español.	-7.24 / 5.73	-7.30 / 5.68	-7.09 / 5.44
El rápido zorro marrón salta sobre el perro perezoso.	-7.80 / 5.51	-7.56 / 5.33	-7.59 / 5.52
La inteligencia artificial está transformando el mundo.	-8.58 / 6.04	-8.48 / 6.22	-8.26 / 5.62
Hoy es un gran día para la tecnología.	-9.58 / 6.38	-9.74 / 6.51	-9.74 / 6.67
El aprendizaje automático puede generar voz natural.	-8.75 / 5.99	-8.80 / 5.82	-8.39 / 5.96
El clima es hermoso hoy.	-11.04 / 7.09	-11.31 / 7.39	-11.46 / 7.23
Ella vende conchas marinas en la orilla del mar.	-10.49 / 6.41	-10.40 / 6.10	-10.87 / 6.13
La programación es tanto un arte como una ciencia.	-8.60 / 5.91	-8.78 / 6.02	-8.37 / 5.96
Gracias por usar este modelo.	-9.11 / 6.61	-9.20 / 6.83	-9.92 / 7.02

所有文本的 Mel Mean 方差均 < 3.0，Mel Std 方差均 < 2.0，频谱分布在多次运行间保持稳定。

CPU-NPU 结构一致性数据

CPU 与 NPU 各运行一次，对比输出波形的梅尔频谱统计量：

测试文本	CPU 波形长度	NPU 波形长度	CPU Mel Mean	NPU Mel Mean	Mel Mean 差值	Mel Std 差值	状态
Hola, bienvenido al mundo de la síntesis de voz.	54784	58624	-8.13	-8.92	0.7878	0.6561	PASS
Esta es una prueba del sistema de síntesis de voz en español.	70656	68352	-7.04	-6.94	0.1052	0.1393	PASS
El rápido zorro marrón salta sobre el perro perezoso.	77312	71424	-7.69	-8.31	0.6251	1.1127	PASS
La inteligencia artificial está transformando el mundo.	70400	68864	-8.48	-7.98	0.5002	0.2943	PASS
Hoy es un gran día para la tecnología.	55296	54272	-9.38	-8.44	0.9450	0.7422	PASS
El aprendizaje automático puede generar voz natural.	69120	70656	-8.47	-9.21	0.7385	0.7584	PASS
El clima es hermoso hoy.	38400	38400	-11.00	-10.27	0.7328	0.4190	PASS
Ella vende conchas marinas en la orilla del mar.	67072	60160	-10.59	-8.81	1.7759	0.8178	PASS
La programación es tanto un arte como una ciencia.	71936	67584	-8.30	-8.13	0.1648	0.3757	PASS
Gracias por usar este modelo.	41728	40704	-8.91	-8.40	0.5100	0.2442	PASS

CPU 与 NPU 的梅尔均值差均 < 2.0，标准差差均 < 2.0，且所有输出均为有效语音波形（非零、有限值、范围合理）。

精度验证结论：PASS —— 梅尔频谱分布稳定，CPU-NPU 结构一致性良好。

注：由于 VITS 的随机时长预测器，同一文本多次合成的音频长度和波形会有差异，但梅尔频谱的均值/标准差在不同运行间保持稳定，且所有输出均为有效语音波形。波形长度差异属于模型本身的生成式特性，不是 NPU 适配引入的问题。

4.2 性能验证

运行命令：

python accuracy_run_perf.py ./model 10 perf_report.json

NPU 性能结果（10 次迭代，warmup 3 次）：

指标	数值
平均延迟	104.9 ms
P50 延迟	105.1 ms
P90 延迟	114.4 ms
最小延迟	91.3 ms
最大延迟	114.4 ms
RTF (Real-Time Factor)	0.027
字符吞吐	436.5 chars/s

RTF = 0.027 表示合成速度约为实时播放的 37.0 倍，满足实时推理需求。

详细延迟数据（10 次迭代原始值）：

Iter  1:  104.6 ms
Iter  2:  103.7 ms
Iter  3:  112.0 ms
Iter  4:  114.4 ms
Iter  5:  102.8 ms
Iter  6:  112.6 ms
Iter  7:   91.3 ms
Iter  8:  105.6 ms
Iter  9:  108.8 ms
Iter 10:   93.4 ms

延迟分布分析：

平均延迟：104.9 ms
标准差：~7.7 ms
波动范围：91.3 ms ~ 114.4 ms（波动幅度约 23.1 ms）
无异常抖动，延迟稳定

5. 推理示例

from transformers import VitsModel, AutoTokenizer
import torch
import scipy.io.wavfile as wavfile

# 加载模型（自动使用 NPU）
model = VitsModel.from_pretrained("./model").to("npu")
tokenizer = AutoTokenizer.from_pretrained("./model")

# 合成语音
text = "Hola, bienvenido al mundo de la síntesis de voz."
inputs = tokenizer(text, return_tensors="pt").to("npu")

with torch.no_grad():
    output = model(**inputs).waveform

# 保存音频
waveform = output[0].cpu().numpy()
wav_data = (waveform * 32767).astype("int16")
wavfile.write("output.wav", rate=model.config.sampling_rate, data=wav_data)

6. 项目结构

.
├── model/                      # 模型权重
│   ├── config.json
│   ├── model.safetensors       # 模型权重（~138MB）
│   ├── vocab.json
│   ├── tokenizer_config.json
│   └── special_tokens_map.json
├── inference.py                # NPU 推理脚本
├── accuracy_run.py             # 精度验证脚本
├── accuracy_run_perf.py        # 性能基准测试脚本
├── accuracy_report.json        # 精度验证报告
├── perf_report.json            # 性能测试报告
└── readme.md                   # 本文档

7. 注意事项

随机性：VITS 使用随机时长预测器，同一文本多次合成的音频长度和波形会有差异，但听感和语义内容保持一致。这是模型本身的特性，不是 NPU 适配引入的问题。
NPU 初始化：transfer_to_npu 会自动替换 torch.cuda.* 为 torch.npu.*，首次 import 会有警告，属正常现象。
音频保存：使用 scipy.io.wavfile 保存 16-bit PCM WAV 文件，无需额外安装 torchcodec。
输入文本：模型使用西班牙语文本输入，支持大小写和标点符号。
首次推理延迟：首次推理包含图编译开销，延迟约 42s，后续推理延迟稳定在 ~100ms。
模型规模：模型仅 36M 参数，权重文件约 138MB，单卡即可高效运行。
内存占用：NPU 上推理时显存占用约 500MB，适合资源受限环境部署。

8. 引用

@article{pratap2023mms,
    title={Scaling Speech Technology to 1,000+ Languages},
    author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},
    journal={arXiv},
    year={2023}
}

适配方：Ascend-SACT 标签：#NPU #Ascend #TTS #Spanish #VITS