kani-tts-450m-0.1-ft NPU 适配

简介

本仓库提供 kani-tts-450m-0.1-ft 模型在 Ascend NPU 上的推理适配。

模型说明：

架构：LFM2 (Liquid Foundation Model 2) ForCausalLM
参数量：450M
采样率：22kHz
语言：英语（在 Expresso 对话数据集上微调）
原始权重：https://huggingface.co/nineninesix/kani-tts-450m-0.1-ft
基座模型：kani-tts-450m-0.2-pt

适配方式：

LFM2 模型加载在 NPU 上进行自回归音频 token 生成
NanoCodec 音频解码器在 CPU 上运行（NeMo 框架暂不支持 NPU）

环境要求

组件	版本
Python	3.11
torch	2.4.0
torch-npu	2.4.0
transformers	>= 4.57.0
soundfile	>= 0.12.0

NPU 环境要求 Ascend 910B4 及以上，驱动版本 24.1.rc1+。

模型下载

pip install huggingface-hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download nineninesix/kani-tts-450m-0.1-ft --local-dir /path/to/kani-tts-450m-0.1-ft

推理脚本

单次推理

python inference.py \
  --model-path /path/to/kani-tts-450m-0.1-ft \
  --text "Hello, this is a test of the text to speech system." \
  --output output.wav

Python API

import sys, os
sys.path.insert(0, "/opt/atomgit/output")
from kani_npu import KaniTTSNPU

engine = KaniTTSNPU("/path/to/kani-tts-450m-0.1-ft")
audio = engine("Hello, this is a test of the text to speech system.")

import soundfile as sf
sf.write("output.wav", audio, 22050)

推理结果

文本	音频时长	生成耗时	实时率 (RTF)
Hello, this is a test of the text to speech system...	7.92s	50.2s	6.33

精度评测

NPU 推理精度通过与 CPU（float32 参考）逐 token logits 对比验证，主要指标包括余弦相似度、token 匹配率和加权相对误差。

运行精度测试：

python eval_accuracy.py

测试文本覆盖英语对话场景，评测结果：

文本	余弦相似度	Token匹配率	加权相对误差	结果
Hello, this is a test of the text to speech system.	0.999924	100.0%	0.000%	PASS
I do believe Marsellus Wallace, my husband, your boss...	0.999361	100.0%	0.400%	PASS
What do you call a lawyer with an IQ of 60? Your honor.	0.999576	100.0%	0.000%	PASS

注意事项

NPU 内存约 31GB，可容纳 450M 参数的 bfloat16 模型（约 0.9GB）
NanoCodec 解码在 CPU 上运行，单次推理额外增加约 1-2 秒
建议 max_new_tokens 不超过 3000，避免生成长度过长导致质量下降
此模型在 Expresso 对话数据集上微调，适合英语对话场景