CosyVoice2是通义实验室推出的一款先进的语音合成模型,专注于实现低延迟、高质量的流式语音生成,并支持多语言、零样本语音克隆及精细的情感控制。
本文主要基于Triton Inference Server进行CosyVoice2模型的服务化部署,Triton Inference Server是一款功能全面的开源推理服务化软件,具备多框架模型支持、并发执行与动态批处理能力,并提供了基于Python的自定义后端开发与模型组合等功能。
多语言支持
超低时延
高准确率
强稳定性
自然体验
Triton Inference Server支持多种推理后端,包括专用后端(如ONNX、TensorFlow)和灵活的Python后端,基于昇腾设备上部署可通过Python后端来实现
Triton Inference Server的模型仓库是一个目录结构化的模型管理系统,它定义了模型在服务中的存放路径(包含多个模型)、模型配置(输入/输出、batch、并发等)和版本控制(多个模型版本)。
模型仓库的整体配置如下:
model_repository
└── model_name
├── 1 # 表示版本信息,可以同时管理多个版本
└── model.py # 服务后端推理代码
└── config.pbtxt # 配置模型的参数,如模型名称、执行后端、输入输出等本项目已经打包了镜像文件cosyvoice2-triton.tar.gz,可直接下载镜像进行部署使用,详细部署方式见 3、详细部署步骤 及后续内容
docker load < cosyvoice2-triton.tar.gzdocker run -itd --privileged --name=triton_server_cosyvoice2 --net=host --shm-size=500g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-e LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:$LD_LIBRARY_PATH \
cosyvoice2-tritonserver:24.10-8.3.rc2 /bin/bash目录结构如下:
|-- model_repository
| |-- cosyvoice
| |-- 1
| | |-- CosyVoice2 # ModelZoo仓库目录
| | | |-- 300I
| | | | |-- diff_CosyVoice_300I.patch
| | | | |-- modeling_qwen2.py
| | | |-- 800I
| | | | |-- diff_CosyVoice_800I.patch
| | | | |-- modeling_qwen2.py
| | | |-- CosyVoice # CosyVoice仓库目录
| | | | |-- cosyvoice
| | | | |-- transformers
| | | | |-- infer.py
| | | |-- modify_onnx.py
| | | |-- requirements.txt
| | |-- model.py # 服务后端推理代码
| |-- client.py # 客户端请求测试脚本
| |-- config.pbtxt # 配置模型的参数,如模型名称、执行后端、输入输出等# 相关环境变量导入
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/opt/tritonserver/model_repository/cosyvoice/1/CosyVoice2/CosyVoice/third_party/Matcha-TTS
export PYTHONPATH=$PYTHONPATH:/opt/tritonserver/model_repository/cosyvoice/1/CosyVoice2/CosyVoice/transformers/src
export PYTHONPATH=$PYTHONPATH:/opt/tritonserver/model_repository/cosyvoice/1/CosyVoice2/CosyVoice
export PYTHONIOENCODING=utf-8
# triton服务化启动
tritonserver --model-repo=./model_repository/ --http-address=0.0.0.0 --http-port=8989# client脚本执行生成的语音会保存在sft_result_triton.wav文件
# 首次推理由于执行编译,时间较长,首次编译后,后续推理无需重复编译
python3 model_repository/cosyvoice/client.py| 配套 | 版本 | 环境准备指导 |
|---|---|---|
| 设备型号 | Atlas 800T A2 910B | \ |
| 固件与驱动 | 25.2.0 | Pytorch框架推理环境准备 |
| Triton镜像 | 24.10 | Triton Inference Server | NVIDIA NGC |
| CANN | 8.3.RC2 | 包含kernels包和toolkit包 |
| Python | 3.10 | - |
| PyTorch | 2.3.1 | - |
| Ascend Extension PyTorch | 2.3.1.post6 | - |
docker pull nvcr.io/nvidia/tritonserver:24.10-py3docker run -itd --privileged --name=triton_server_cosyvoice2_24.10 --net=host --shm-size=500g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-e LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:$LD_LIBRARY_PATH \
nvcr.io/nvidia/tritonserver:24.10-py3 /bin/bashdocker exec -it triton_server_cosyvoice2_24.10 /bin/bashecho 'LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:$LD_LIBRARY_PATH' >> ~/.bashrcwget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.3.RC1/Ascend-cann-kernels-910b_8.3.RC1_linux-aarch64.run?response-content-type=application/octet-stream -O Ascend-cann-kernels-910b_8.3.RC1_linux-aarch64.run
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.3.RC1/Ascend-cann-toolkit_8.3.RC1_linux-aarch64.run?response-content-type=application/octet-stream -O Ascend-cann-toolkit_8.3.RC1_linux-aarch64.runchmod +x Ascend-cann-*
./Ascend-cann-toolkit_8.3.RC1_linux-aarch64.run --install
./Ascend-cann-kernels-310p_8.3.RC1_linux-aarch64.run --install拉取ModelZoo仓库源码
git clone https://gitee.com/ascend/ModelZoo-PyTorch.git
# 将ModelZoo-PyTorch仓库目录记为 ${ModelZoo-PyTorch}拉取本仓源码
git clone https://atomgit.com/Ascend-SACT/CosyVoice2-Triton.git
# 将CosyVoice2-Triton仓库目录记为 ${CosyVoice2-Triton}将CosyVoice模型目录放入版本文件夹下
cd ${CosyVoice2-Triton}/model_repository/1
cp -r ${ModelZoo-PyTorch}/ACL_PyTorch/built-in/audio/CosyVoice2 ./目录结构如下
CosyVoice2-Triton/
|-- model_repository
| |-- cosyvoice
| |-- 1
| | |-- CosyVoice2
| | |-- model.py
| |-- client.py
| |-- config.pbtxt模型迁移参考:CosyVoice(TorchAir)-推理指导,涉及主要步骤如下
部署CosyVoice与Transformer
cd ${CosyVoice2-Triton}/model_repository/cosyvoice/1/CosyVoice2
# 获取CosyVoice源码
git clone https://github.com/FunAudioLLM/CosyVoice
cd CosyVoice
git reset --hard fd45708
git submodule update --init --recursive
# 根据当前使用机型,叠加patch。当前使用机型为800T A2,和800I共用patch文件
git apply ../800I/diff_CosyVoice_800I.patch
# 将infer.py复制到CosyVoice中
cp ../infer.py ./
cd ..
# 获取Transformer源码
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout v4.37.0
# 将modeling_qwen模型文件替换到transformers仓内。当前使用机型为800I A2,和800I共用modeling_qwen2.py。
cp ../800I/modeling_qwen2.py ./transformers/src/transformers/models/qwen2文件目录结构如下
|-- model_repository
| |-- cosyvoice
| |-- 1
| | |-- CosyVoice2
| | | |-- 300I
| | | | |-- diff_CosyVoice_300I.patch
| | | | |-- modeling_qwen2.py
| | | |-- 800I
| | | | |-- diff_CosyVoice_800I.patch
| | | | |-- modeling_qwen2.py
| | | |-- CosyVoice
| | | | |-- cosyvoice
| | | | |-- transformers
| | | | |-- infer.py
| | | |-- modify_onnx.py
| | | |-- requirements.txt
| | |-- model.py
| |-- client.py
| |-- config.pbtxt依赖安装
apt-get update
apt-get install sox git-lfs
pip3 install tokenizers==0.15.1
pip3 install "ruamel.yaml<0.17"
# 手动编译安装openfst,否则WeTextProcessing安装会有报错
# 下载安装包并解压
wget https://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.8.3.tar.gz
# 进入目录后编译安装
./configure --enable-far --enable-mpdt --enable-pdt
make -j$(nproc)
make install
# 确认动态库文件存在:
ls /usr/local/lib/libfstmpdtscript.so.26
# 配置动态库路径
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
# 安装WeTextProcessing
pip3 install WeTextProcessing==1.0.4.1
# 安装requirements
pip3 install -r ../requirements.txtmsit工具安装
参考msit安装指南,使用源码方式安装
# 1. git pull origin 更新最新代码
git clone https://gitee.com/ascend/msit.git
cd msit/msit
# 2. 安装 msit 包
pip install .
# 4. 安装benchmark和surgeon,会自动部署ais_bench
msit install benchmark surgeon获取权重数据
cd ${CosyVoice2-Triton}/model_repository/cosyvoice/1/CosyVoice2/CosyVoice
# 1. 克隆
git clone https://www.modelscope.cn/iic/CosyVoice2-0.5B.git
cd CosyVoice2-0.5B
# 将CosyVoice2-0.5B目录记为${CosyVoice2-0.5B}
# 2. 切换到目标 commit
git checkout 9bd5b08fc085bd93d3f8edb16b67295606290350
# 3. 拉取 LFS 大文件(如模型权重)
git lfs pull
# 4. 本用例采用sft预训练音色推理,需额外下载spk权重放到权重目录下
wget https://www.modelscope.cn/models/iic/CosyVoice-300M-SFT/resolve/master/spk2info.pt模型转换
# 修改onnx结构
cd ${CosyVoice2-Triton}/model_repository/cosyvoice/1/CosyVoice2/
python3 modify_onnx.py ./CosyVoice/CosyVoice2-0.5B/
# 模型转换
source /usr/local/Ascend/ascend-toolkit/set_env.sh
atc --framework=5 --soc_version=${soc_version} --model ./${CosyVoice2-0.5B}/speech_token_md.onnx --output ./${CosyVoice2-0.5B}/speech --input_shape="feats:1,128,-1;feats_length:1" --precision_mode allow_fp32_to_fp16
atc --framework=5 --soc_version=${soc_version} --model ./${CosyVoice2-0.5B}/flow.decoder.estimator.fp32.onnx --output ./${CosyVoice2-0.5B}/flow --input_shape="x:2,80,-1;mask:2,1,-1;mu:2,80,-1;t:2;spks:2,80;cond:2,80,-1"
atc --framework=5 --soc_version=${soc_version} --model ./${CosyVoice2-0.5B}/flow.decoder.estimator.fp32.onnx --output ./${CosyVoice2-0.5B}/flow_static --input_shape="x:2,80,-1;mask:2,1,-1;mu:2,80,-1;t:2;spks:2,80;cond:2,80,-1" --dynamic_dims="100,100,100,100;200,200,200,200;300,300,300,300;400,400,400,400;500,500,500,500;600,600,600,600;700,700,700,700" --input_format=ND模型推理验证
cd ${CosyVoice2-Triton}/model_repository/cosyvoice/1/CosyVoice2/CosyVoice
# 1. 指定使用NPU ID,默认为0
export ASCEND_RT_VISIBLE_DEVICES=0
# 2. 设置环境变量
export PYTHONPATH=third_party/Matcha-TTS:$PYTHONPATH
export PYTHONPATH=transformers/src:$PYTHONPATH
# 3. 执行推理脚本
python3 infer.py --model_path=${CosyVoice2-0.5B} --stream_outTriton依赖安装
pip3 install tritonclient pyworld gevent geventhttpclient服务化启动
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:${CosyVoice2-Triton}/model_repository/cosyvoice/1/CosyVoice2/CosyVoice/third_party/Matcha-TTS
export PYTHONPATH=$PYTHONPATH:${CosyVoice2-Triton}/model_repository/cosyvoice/1/CosyVoice2/CosyVoice/transformers/src
export PYTHONPATH=$PYTHONPATH:${CosyVoice2-Triton}/model_repository/cosyvoice/1/CosyVoice2/CosyVoice
export PYTHONIOENCODING=utf-8
cd ${CosyVoice2-Triton}
tritonserver --model-repo=./model_repository/ --http-address=0.0.0.0 --http-port=8989客户端测试
client脚本执行生成的语音会保存在sft_result_triton.wav文件
首次推理由于执行编译,时间较长,首次编译后,后续推理无需重复编译。可以考虑在服务启动时先进行预热
python3 model_repository/cosyvoice/client.py