Parler-TTS Mini v0.1 的华为昇腾 NPU 适配版本,支持在 Ascend 910 上进行文本转语音(TTS)推理。
Parler-TTS Mini v0.1 是一个轻量级文本转语音模型,基于 10.5K 小时音频数据训练,能够生成高质量、自然的语音,并可通过文本描述控制语音特征(如性别、背景噪声、语速、音调和混响)。
| 组件 | 版本 |
|---|---|
| PyTorch | 2.9.0 |
| torch_npu | 2.9.0.post1 |
| CANN | 8.5.1 |
| transformers | 4.46.1 |
| parler_tts | 0.2.2 |
| NPU | Ascend 910 x2 |
pip install modelscope
modelscope download --model AI-ModelScope/parler_tts_mini_v0.1pip install git+https://github.com/huggingface/parler-tts.git
pip install soundfile scipy numpy
pip install transformers==4.46.1# NPU 推理
python inference.py --device npu
# CPU 推理
python inference.py --device cpu
# NPU vs CPU 精度对比
python inference.py --device compareimport torch
import torch_npu
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, GenerationConfig
MODEL_DIR = ".cache/modelscope/hub/models/AI-ModelScope/parler_tts_mini_v0___1"
model = ParlerTTSForConditionalGeneration.from_pretrained(MODEL_DIR)
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model.generation_config = GenerationConfig.from_pretrained(MODEL_DIR)
model.generation_config.do_sample = False
model = model.to("npu:0")
model.eval()
description = "A female speaker, very clear audio quality."
prompt = "Hello, this is a test of NPU text to speech."
inputs = tokenizer(description, return_tensors="pt")
prompt_inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
generation = model.generate(
input_ids=inputs.input_ids.to("npu:0"),
attention_mask=inputs.attention_mask.to("npu:0"),
prompt_input_ids=prompt_inputs.input_ids.to("npu:0"),
)
audio = generation.cpu().numpy().squeeze()
import soundfile as sf
sf.write("output.wav", audio, 44100)精度对比采用多层级评估策略:
| 测试 | 编码器(T5)相对MAE | 音频相对MAE | 编码器余弦相似度 | 音频余弦相似度 | NPU耗时 | CPU耗时 | 加速比 |
|---|---|---|---|---|---|---|---|
| Test 1 | 0.000011% | 2.87% | 1.00000000 | 0.947596 | 11.43s | 202.87s | 17.75x |
| Test 2 | 0.000009% | 0.22% | 1.00000000 | 0.986712 | 10.86s | 203.13s | 18.70x |
| Test 3 | 0.000010% | 4.46% | 1.00000000 | 0.328113 | 10.86s | 203.75s | 18.76x |
编码器层精度完美对齐,证明NPU上的模型计算与CPU完全一致。音频波形差异主要来源于自回归解码过程中浮点精度差异的积累,属于TTS模型正常行为,不影响语音质量和可懂度。
| 指标 | NPU (Ascend 910) | CPU | 加速比 |
|---|---|---|---|
| 推理延迟 (单次) | ~11s | ~203s | ~18x |
| RTF (实时因子) | ~0.04 | ~0.67 | ~17x |
| 吞吐量 | ~0.09 utterances/s | ~0.005 utterances/s | ~18x |
| 文件 | 说明 |
|---|---|
inference.py | NPU 推理脚本 |
README.md | 部署文档(本文档) |
accuracy_report.json | 精度评测报告(JSON) |
output_npu_test*.wav | NPU 推理音频输出 |
output_cpu_test*.wav | CPU 推理音频输出 |
tags:
- text-to-speech
- ascend-npu
- NPU
- Hardware/NPU
- ascend
- huawei
- pytorch
- parler-tts
- tts
- audio-generation
hardware:
- Huawei Ascend 910 NPU
- CANN 8.5.1@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
title = {Parler-TTS},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/parler-tts}}
}Apache 2.0