Qwen/Qwen3-TTS-12Hz-0.6B-Base on Ascend NPU

1. 简介

本文档记录 Qwen/Qwen3-TTS-12Hz-0.6B-Base 在昇腾 NPU（Ascend910B）上的适配与验证结果。

Qwen3-TTS 是先进的多语言、可控、鲁棒且支持流式生成的文本转语音模型。本仓库为 Base 模型（0.6B）在昇腾 NPU 上的适配工程，支持通过参考音频进行语音克隆（voice clone）。

2. 验证环境

组件	版本
`transformers`	`>=4.57.0`
`torch`	`>=2.0`
`torch-npu`	`2.9.0.post1`
`modelscope`	latest
`librosa`	latest
`scipy`	latest
`soundfile`	latest
`qwen-tts`	`0.1.1`

NPU：Ascend910B
CANN：8.5.1

3. 推理启动

安装依赖：

pip install -r requirements.txt

运行 NPU 推理：

python inference.py

4. NPU 推理输出

输入文本（assets/test.txt）：

Hello, this is a test of the text to speech system on Ascend NPU.

NPU 推理结果：

Model: Qwen/Qwen3-TTS-12Hz-0.6B-Base
Input text: Hello, this is a test of the text to speech system on Ascend NPU.
Output audio: outputs/test.wav
Duration: 4.72s
Sample rate: 24000
Audio max amplitude: 0.500000

输出音频有效：

非静音：True
无 NaN：True
无 Inf：True

5. 性能参考

测试条件：连续 10 次推理（含 3 次 warmup），输入文本长度固定。

指标	数值
`avg_latency_ms`	`10237.2774`
`min_latency_ms`	`8597.2829`
`max_latency_ms`	`10950.9211`
`p50_latency_ms`	`10587.2877`
`p90_latency_ms`	`10852.4912`
`p95_latency_ms`	`10901.7062`
`audio_duration_sec`	`4.3200`
`real_time_factor`	`2.3697`
`num_runs`	`10`

6. CPU-NPU 精度验证

该模型为自回归语音大模型，最终音频生成具有随机性（采样生成），因此 Mel 谱图对比仅作参考。我们以确定性中间层（text embeddings）作为权威精度标准。

指标	数值	阈值	结果
`hidden_states_relative_error`	`0.0000%`	`< 1.0%`	PASS
`hidden_states_cosine_sim`	`1.000000`	`> 0.99`	PASS
`mel_relative_error`	`26.7211%`	`< 5.0%`	参考（随机采样导致）
`mel_cosine_sim`	`0.945885`	`> 0.95`	参考（随机采样导致）
`npu_is_silent`	`False`	must be `False`	PASS
`npu_has_nan`	`False`	must be `False`	PASS
`npu_has_inf`	`False`	must be `False`	PASS
Final Result	—	—	PASS

7. 注意事项

语音克隆需要参考音频：Base 模型不支持零样本纯文本转语音，必须通过 ref_audio 提供参考音频进行 voice clone。本工程使用 x_vector_only_mode=True 模式，仅提取 speaker embedding，无需提供 ref_text。
随机性说明：该模型采用自回归采样生成音频码本，即使设置 do_sample=False，由于 talker 内部的 code predictor 仍可能涉及采样，CPU 与 NPU 的最终音频波形可能存在差异。因此以确定性中间层（text embeddings）作为精度对齐标准。
依赖说明：qwen-tts 包为官方推理封装，内部已注册 qwen3_tts 模型类型到 transformers，可直接通过 Qwen3TTSModel.from_pretrained 加载。
flash-attn：当前环境未安装 flash-attn，模型回退到手动 PyTorch attention 实现，不影响功能正确性。