Kokoro 是一系列体积虽小但功能强大的 TTS 模型。
该模型是经过短期训练的结果,从专业数据集中添加了100名中文使用者。中文数据由专业数据集公司「LongMaoData」免费且无偿地提供给我们。感谢你们让这个模型成为可能。
另外,一些众包合成英语数据也进入了训练组合:[1]
1小时的 Maple,美国女性。 1小时的 Sol,另一位美国女性。 和1小时的 Vale,一位年长的英国女性。 由于该模型删除了许多声音,因此它并不是对其前身的严格升级,但它提前发布以收集有关新声音和标记化的反馈。除了中文数据集和3小时的英语之外,其余数据都留在本次训练中。目标是推动模型系列的发展,并最终恢复一些被遗留的声音。
美国版权局目前的指导表明,合成数据通常不符合版权保护的资格。由于这些合成数据是众包的,因此模型训练师不受任何服务条款的约束。该 Apache 许可模式也符合 OpenAI 所宣称的广泛传播 AI 优势的使命。如果您愿意帮助进一步完成这一使命,请考虑为此贡献许可的音频数据。
表1 硬件设备
| 设备型号 | NPU配置 |
|---|---|
| Atlas 800I A2 | 8*64G |
| Atlas 800T A2 | 8*64G |
表2 软件版本配套表
| 配套 | 版本 | 环境准备指导 |
|---|---|---|
| cann | 8.3.RC2 | - |
| Python | 3.11.13 | - |
| torch | 2.7.1+cpu | - |
| torch_npu | 2.7.1 | - |
| transformers | 4.57.1 | - |
| vllm | 0.11.0+empty | - |
| vllm_ascend | 0.11.0rc2 | - |
方式一、通过vllm-ascend镜像安装部署
点击下载链接,打开网页后,选择 v0.11.0rc2 版本下载
1、执行以下命令下载
docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2-openeuler2、执行以下命令查看镜像是否下载成功
docker images | grep v0.11.0rc2-openeuler方式二、通过安装好的镜像包直接运行推理(无需执行3.1.3、3.1.5步骤) 1、执行以下命令下载
gitcode download Ascend-SACT/Kokoro
cd Kokoro
unzip kokoro_img_v1.tar.gz.zip
docker load -i kokoro_img_v1.tar.gz2、执行以下命令查看镜像是否创建成功
docker images | grep kokoro_imageKokoro-82M-v1.1-zh 权重及配置文件说明
| 模型 | 权重 |
|---|---|
| Kokoro-82M-v1.1-zh | huggingface下载链接 |
docker run -itd -u root \
--net=host \
--privileged=true \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root:/root \
-p 8001:8001 \
--shm-size 1024g \
--name vllm-ascend-wlh2 \
quay.io/ascend/vllm-ascend:v0.11.0rc2-openeulerdocker exec -it -u root vllm-ascend-wlh2 bashpip install -q kokoro
pip install -q soundfile
pip install -q "misaki[zh]>=0.8.2"下载en_core_web_sm-3.8.0英文插件包: https://github.com/explosion/spacy-models/releases?q=en_core_web_sm&expanded=true
pip install en_core_web_sm-3.8.0-py3-none-any.whl需要将MODLE_DIR改为实际环境路径
# This file is hardcoded to transparently reproduce HEARME_zh.wav
# Therefore it may NOT generalize gracefully to other texts
# Refer to Usage in README.md for more general usage patterns
# pip install kokoro>=0.8.1 "misaki[zh]>=0.8.1"
from kokoro import KModel, KPipeline
from pathlib import Path
import numpy as np
import soundfile as sf
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu
import tqdm
import os
MODLE_DIR = '/home/w00634320/Kokoro-82M-v1.1-zh'
MODLE_PATH = os.path.join(MODLE_DIR, 'kokoro-v1_1-zh.pth')
CONFIG_PATH = os.path.join(MODLE_DIR, 'config.json')
REPO_ID = "hexgrad/Kokoro-82M-v1.1-zh"
VOICE_PATH = os.path.join(MODLE_DIR, "voices/zf_001.pt")
SAMPLE_RATE = 24000
EN_VOICE_PATH = os.path.join(MODLE_DIR, "voices/af_maple.pt")
# How much silence to insert between paragraphs: 5000 is about 0.2 seconds
N_ZEROS = 5000
# Whether to join sentences in paragraphs 1 and 3
JOIN_SENTENCES = True
VOICE = 'zf_001' if True else 'zm_010'
device = 'npu'
texts = [(
"Kokoro 是一系列体积虽小但功能强大的 TTS 模型。",
), (
"该模型是经过短期训练的结果,从专业数据集中添加了100名中文使用者。",
"中文数据由专业数据集公司「龙猫数据」免费且无偿地提供给我们。感谢你们让这个模型成为可能。",
), (
"另外,一些众包合成英语数据也进入了训练组合:",
"1小时的 Maple,美国女性。",
"1小时的 Sol,另一位美国女性。",
"和1小时的 Vale,一位年长的英国女性。",
), (
"由于该模型删除了许多声音,因此它并不是对其前身的严格升级,但它提前发布以收集有关新声音和标记化的反馈。",
"除了中文数据集和3小时的英语之外,其余数据都留在本次训练中。",
"目标是推动模型系列的发展,并最终恢复一些被遗留的声音。",
), (
"美国版权局目前的指导表明,合成数据通常不符合版权保护的资格。",
"由于这些合成数据是众包的,因此模型训练师不受任何服务条款的约束。",
"该 Apache 许可模式也符合 OpenAI 所宣称的广泛传播 AI 优势的使命。",
"如果您愿意帮助进一步完成这一使命,请考虑为此贡献许可的音频数据。",
)]
if JOIN_SENTENCES:
for i in (1, 3):
texts[i] = [''.join(texts[i])]
# HACK: Mitigate rushing caused by lack of training data beyond ~100 tokens
# Simple piecewise linear fn that decreases speed as len_ps increases
def speed_callable(len_ps):
speed = 0.8
if len_ps <= 83:
speed = 1
elif len_ps < 183:
speed = 1 - (len_ps - 83) / 500
return speed * 1.1
en_pipeline = KPipeline(lang_code='a', repo_id=REPO_ID, model=False)
def en_callable(text):
if text == 'Kokoro':
return 'kˈOkəɹO'
elif text == 'Sol':
return 'sˈOl'
return next(en_pipeline(text)).phonemes
model = KModel(repo_id=REPO_ID, config=CONFIG_PATH, model=MODLE_PATH).to(device).eval()
zh_pipeline = KPipeline(lang_code='z', repo_id=REPO_ID, model=model, en_callable=en_callable)
path = Path(__file__).parent
wavs = []
for paragraph in tqdm.tqdm(texts):
for i, sentence in enumerate(paragraph):
generator = zh_pipeline(sentence, voice=VOICE_PATH, speed=speed_callable)
f = path / f'zh{len(wavs):02}.wav'
result = next(generator)
wav = result.audio
sf.write(f, wav, SAMPLE_RATE)
if i == 0 and wavs and N_ZEROS > 0:
wav = np.concatenate([np.zeros(N_ZEROS), wav])
wavs.append(wav)
sf.write(path / f'HEARME_{VOICE}.wav', np.concatenate(wavs), SAMPLE_RATE) def transform(self, input_data):
forward_transform = torch.stft(
input_data,
self.filter_length, self.hop_length, self.win_length, window=self.window.to(input_data.device),
return_complex=True)
real_part = forward_transform.real
imag_part = forward_transform.imag
return torch.sqrt(real_part**2 + imag_part**2), torch.angle(forward_transform)source /usr/local/Ascend/ascend-toolkit/set_env.sh
python test_infer.py测试脚本如下
# This file is hardcoded to transparently reproduce HEARME_zh.wav
# Therefore it may NOT generalize gracefully to other texts
# Refer to Usage in README.md for more general usage patterns
# pip install kokoro>=0.8.1 "misaki[zh]>=0.8.1"
from kokoro import KModel, KPipeline
from pathlib import Path
import numpy as np
import soundfile as sf
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu
import tqdm
import os
import time
import threading
MODLE_DIR = '/home/xxxx/Kokoro-82M-v1.1-zh'
MODLE_PATH = os.path.join(MODLE_DIR, 'kokoro-v1_1-zh.pth')
CONFIG_PATH = os.path.join(MODLE_DIR, 'config.json')
REPO_ID = "hexgrad/Kokoro-82M-v1.1-zh"
VOICE_PATH = os.path.join(MODLE_DIR, "voices/zf_001.pt")
SAMPLE_RATE = 24000
# How much silence to insert between paragraphs: 5000 is about 0.2 seconds
N_ZEROS = 5000
# Whether to join sentences in paragraphs 1 and 3
JOIN_SENTENCES = True
VOICE = 'zf_001' if True else 'zm_010'
device = 'npu'
texts = [
"春天,是一年四季中最富有生机的季节。天气渐渐变暖,阳光变得柔和,大地披上了翠绿的新装。树木抽出嫩芽,花朵竞相开放,五彩斑斓,像是给世界增添了无数的欢笑。田野里,农民伯伯开始忙碌地播种,期待着秋天的丰收。公园中,孩子们奔跑嬉戏,老人在树下散步聊天,一切都显得那么和谐美好。",
"夏日的阳光像一把金色的琴弦,弹奏出热烈而欢快的旋律。蝉鸣是夏日最忠实的乐手,在树梢上不知疲倦地唱着。午后,阳光炙烤着大地,连空气都仿佛在微微颤动。但夏天也有它的温柔时刻,比如傍晚的凉风轻拂过脸颊,带来一丝清凉,或是夜晚微星光亮中,一家人围坐在院子里吃着西瓜,谈笑风生。",
"秋风拂过,带来了凉爽与成熟的味道。金黄的稻田在阳光下闪闪发光,仿佛大地披上了一件华丽的外衣。树叶从枝头缓缓飘落,像一只只翩翩起舞的蝴蝶,点缀着大地的画卷。秋天是收获的季节,果园里果实累累,苹果、柿子、葡萄,香甜的气息弥漫在空气中。人们忙着采摘,脸上洋溢着丰收的喜悦。",
"冬天悄然而至,寒风裹挟着雪花,为大地披上一层银装。清晨推开窗,白茫茫的世界仿佛被按下暂停键,宁静而纯净。树枝披上冰晶,屋檐垂下冰凌,阳光洒在雪地上,反射出耀眼的光芒。孩子们在雪地里奔跑、打雪仗、堆雪人,欢声笑语驱散了寒冷。"
]
# HACK: Mitigate rushing caused by lack of training data beyond ~100 tokens
# Simple piecewise linear fn that decreases speed as len_ps increases
def speed_callable(len_ps):
speed = 0.8
if len_ps <= 83:
speed = 1
elif len_ps < 183:
speed = 1 - (len_ps - 83) / 500
return speed * 1.1
model = KModel(repo_id=REPO_ID, config=CONFIG_PATH, model=MODLE_PATH).to(device).eval()
zh_pipeline = KPipeline(lang_code='z', repo_id=REPO_ID, model=model)
path = Path(__file__).parent
def test_warmup():
sentence= "中华人民共和国万岁"
generator = zh_pipeline(sentence, voice=VOICE_PATH, speed=speed_callable)
result = next(generator)
end_time = time.time()
wav = result.audio
def test_inf(sentence, num):
f = path / f'test_{num}.wav'
start_time = time.time()
generator = zh_pipeline(sentence, voice=VOICE_PATH, speed=speed_callable)
result = next(generator)
end_time = time.time()
wav = result.audio
speech_len = len(wav) / SAMPLE_RATE
print('yield speech len {}, rtf {}'.format(speech_len, (end_time - start_time) / speech_len))
sf.write(f, wav, SAMPLE_RATE)
test_warmup()
ths = []
for i in range(len(texts)):
t = threading.Thread(target=test_inf, name='LoopThread' + str(i), args=(texts[i], i))
ths.append(t)
t.start()
for t in ths:
t.join()