kani-tts-400m-0.3-pt NPU 适配

简介

本仓库提供 kani-tts-400m-0.3-pt 模型在 Ascend NPU 上的推理适配。

模型说明：

架构：LFM2 (Liquid Foundation Model 2) ForCausalLM
参数量：400M
采样率：22kHz
语言：多语言（英语、日语、德语、阿拉伯语、中文、西班牙语、韩语、吉尔吉斯语）
原始权重：https://huggingface.co/nineninesix/kani-tts-400m-0.3-pt

适配方式：

LFM2 模型加载在 NPU 上进行自回归音频 token 生成
NanoCodec 音频解码器在 CPU 上运行（NeMo 框架暂不支持 NPU）

环境要求

组件	版本
Python	3.11
torch	2.4.0
torch-npu	2.4.0
transformers	>= 4.57.0
soundfile	>= 0.12.0

NPU 环境要求 Ascend 910B4 及以上，驱动版本 24.1.rc1+。

模型下载

pip install huggingface-hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download nineninesix/kani-tts-400m-0.3-pt --local-dir /path/to/kani-tts-400m-0.3-pt

推理脚本

单次推理

python inference.py \
  --model-path /path/to/kani-tts-400m-0.3-pt \
  --text "Hello, this is a test of the text to speech system." \
  --output output.wav

Python API

import sys, os
sys.path.insert(0, "/opt/atomgit/output")
from kani_npu import KaniTTSNPU

engine = KaniTTSNPU("/path/to/kani-tts-400m-0.3-pt")
audio = engine("Hello, this is a test of the text to speech system.")

import soundfile as sf
sf.write("output.wav", audio, 22050)

推理结果

文本	音频时长	生成耗时	实时率 (RTF)
Hello, this is a test of the text to speech system...	6.64s	65.3s	9.83

精度评测

NPU 推理精度通过与 CPU（float32 参考）逐 token logits 对比验证，主要指标包括余弦相似度、token 匹配率和加权相对误差。

运行精度测试：

python eval_accuracy.py

测试文本覆盖 8 种语言，评测结果：

文本	余弦相似度	Token匹配率	加权相对误差	结果
Hello, this is a test of the text to speech system.	0.999912	100.0%	0.000%	PASS
今天天气真好。	0.999904	100.0%	0.476%	PASS
Hola, esto es una prueba.	0.999700	100.0%	2.817%	PASS

注意事项

NPU 内存约 31GB，可容纳 400M 参数的 bfloat16 模型（约 0.8GB）
NanoCodec 解码在 CPU 上运行，单次推理额外增加约 1-2 秒
建议 max_new_tokens 不超过 3000，避免生成长度过长导致质量下降
多语言输入时建议使用对应语言的标点符号以获得最佳效果