VoxCPM2 是一款无分词器的扩散自回归文本转语音模型,拥有 20 亿参数,支持 30 种语言,可输出 48kHz 音频,训练数据基于超过 200 万小时 的多语言语音素材。
阿拉伯语、缅甸语、中文、丹麦语、荷兰语、英语、芬兰语、法语、德语、希腊语、希伯来语、印地语、印尼语、意大利语、日语、高棉语、韩语、老挝语、马来语、挪威语、波兰语、葡萄牙语、俄语、西班牙语、斯瓦希里语、瑞典语、他加禄语、泰语、土耳其语、越南语
汉语方言:四川话、粤语、吴语、东北话、河南话、陕西话、山东话、天津话、闽南话
pip install voxcpm要求: Python ≥ 3.10、PyTorch ≥ 2.5.0、CUDA ≥ 12.0 · 完整快速入门 →
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
wav = model.generate(
text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)在 text 的开头将语音描述放在括号中,后跟需要合成的内容:
wav = model.generate(
text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)# Basic cloning
wav = model.generate(
text="This is a cloned voice generated by VoxCPM2.",
reference_wav_path="speaker.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)
# Cloning with style control
wav = model.generate(
text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
reference_wav_path="speaker.wav",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)为实现最高保真度,请同时提供参考音频及其准确文本记录。将相同的音频片段同时传入reference_wav_path和prompt_wav_path,以获得最高相似度:
wav = model.generate(
text="This is an ultimate cloning demonstration using VoxCPM2.",
prompt_wav_path="speaker_reference.wav",
prompt_text="The transcript of the reference audio.",
reference_wav_path="speaker_reference.wav",
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)import numpy as np
chunks = []
for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)| 属性 | 值 |
|---|---|
| 架构 | 无分词器扩散自回归(LocEnc → TSLM → RALM → LocDiT) |
| 骨干网络 | 基于 MiniCPM-4,共 20 亿参数 |
| 音频 VAE | AudioVAE V2(非对称编解码,16kHz 输入 → 48kHz 输出) |
| 训练数据 | 200 万+小时多语言语音 |
| 语言模型令牌率 | 6.25 Hz |
| 最大序列长度 | 8192 令牌 |
| 数据类型 | bfloat16 |
| 显存占用 | ~8 GB |
| 实时因子(RTX 4090) | ~0.30(标准模式)/ ~0.13(Nano-vLLM 模式) |
VoxCPM2 在主流零样本和可控 TTS 基准测试中均取得了最先进或极具竞争力的结果。
完整的基准测试表格(Seed-TTS-eval、CV3-eval、InstructTTSEval、MiniMax 多语言测试)请参见 GitHub 仓库。
VoxCPM2 支持全参数 SFT 和 LoRA 微调,仅需 5–10 分钟的音频数据即可进行:
# LoRA fine-tuning (recommended)
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
# Full fine-tuning
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml详见 微调指南 以获取完整说明。
@article{voxcpm2_2026,
title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
author = {VoxCPM Team},
journal = {GitHub},
year = {2026},
}
@article{voxcpm2025,
title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
journal = {arXiv preprint arXiv:2509.24650},
year = {2025},
}基于 Apache-2.0 许可证发布,可免费用于商业用途。对于生产环境部署,我们建议针对您的具体使用场景进行全面测试和安全评估。