英文 | 中文
👋 加入我们的社区,参与讨论并获取支持!
飞书
|
Discord
VoxCPM 是一个无分词器的文本转语音(TTS)系统,它通过端到端的扩散自回归架构直接生成连续语音表示,无需经过离散分词环节,从而实现高度自然且富有表现力的语音合成。
VoxCPM2 是最新的重大版本——这是一个拥有20亿参数的模型,在超过200万小时的多语言语音数据上进行训练,目前支持30种语言、声音设计、可控声音克隆以及48kHz工作室级音质的音频输出。该模型基于 MiniCPM-4 架构构建。
此分支已官方适配华为昇腾 NPUs。模型现在可在昇腾硬件上原生运行,具备自动设备检测和完整推理支持。
支持设备: Atlas 800I A2/A3、Atlas A2/A3 训练系列
已验证配置: CANN 8.5.0 + PyTorch 2.9.0 + torch_npu 2.9.0
使用方法: 只需传递 device="npu"(或保留为 "auto" 以自动检测 NPU)
from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", device="npu")
wav = model.generate(text="Hello Ascend NPU", max_len=200)详见 ASCEND.md 了解详细的适配说明和性能基准测试。
汉语方言:四川话、粤语、吴语、东北话、河南话、陕西话、山东话、天津话、闽南话
pip install voxcpm要求: Python ≥ 3.10(<3.13),PyTorch ≥ 2.5.0,CUDA ≥ 12.0 或昇腾NPU(CANN 8.5.0,torch_npu 2.9.0)。详情请参见快速开始文档。
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained(
"openbmb/VoxCPM2",
load_denoiser=False,
)
wav = model.generate(
text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)
print("saved: demo.wav")如果您更倾向于先从 ModelScope 下载,可以使用:
pip install modelscopefrom modelscope import snapshot_download
snapshot_download("OpenBMB/VoxCPM2", local_dir='./pretrained_models/VoxCPM2') # specify the local directory to save the model
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained("./pretrained_models/VoxCPM2", load_denoiser=False)
wav = model.generate(
text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)仅凭自然语言描述即可创建语音,无需参考音频。格式:在 text 开头的括号内填写描述(例如 "(你的语音描述)需要合成的文本。"):
wav = model.generate(
text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)上传参考音频。模型会克隆音色,您仍可使用控制指令调整语速、情感或风格。
wav = model.generate(
text="This is a cloned voice generated by VoxCPM2.",
reference_wav_path="path/to/voice.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)
wav = model.generate(
text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
reference_wav_path="path/to/voice.wav",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)提供参考音频及其精确转录文本,用于基于音频延续的克隆,以重现每一个声音细节。为实现最大程度的克隆相似度,请将相同的参考片段同时传入 reference_wav_path 和 prompt_wav_path,如下所示:
wav = model.generate(
text="This is an ultimate cloning demonstration using VoxCPM2.",
prompt_wav_path="path/to/voice.wav",
prompt_text="The transcript of the reference audio.",
reference_wav_path="path/to/voice.wav", # optional, for better simliarity
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)import numpy as np
chunks = []
for chunk in model.generate_streaming(
text="Streaming text to speech is easy with VoxCPM!",
):
chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)# Voice design (no reference audio needed)
voxcpm design \
--text "VoxCPM2 brings studio-quality multilingual speech synthesis." \
--output out.wav
# Controllable voice cloning with style control
voxcpm design \
--text "VoxCPM2 brings studio-quality multilingual speech synthesis." \
--control "Young female voice, warm and gentle, slightly smiling" \
--output out.wav
# Voice cloning (reference audio)
voxcpm clone \
--text "This is a voice cloning demo." \
--reference-audio path/to/voice.wav \
--output out.wav
# Ultimate cloning (prompt audio + transcript)
voxcpm clone \
--text "This is a voice cloning demo." \
--prompt-audio path/to/voice.wav \
--prompt-text "reference transcript" \
--reference-audio path/to/voice.wav \ # optional, for better simliarity
--output out.wav
# Batch processing
voxcpm batch --input examples/input.txt --output-dir outs
# Help
voxcpm --helppython app.py --port 8808 # then open in browser: http://localhost:8808使用 --device 选择运行时设备:
python app.py --device auto支持的值包括 auto、cpu、mps、cuda 和 cuda:N。在 Apple Silicon 芯片的 Mac 设备上,auto 会在 MPS 可用时使用 MPS。
若需高吞吐量服务,请使用 Nano-vLLM-VoxCPM——这是一款基于 Nano-vLLM 构建的专用推理引擎,支持并发请求和异步 API。
pip install nano-vllm-voxcpmfrom nanovllm_voxcpm import VoxCPM
import numpy as np, soundfile as sf
server = VoxCPM.from_pretrained(model="/path/to/VoxCPM", devices=[0])
chunks = list(server.generate(target_text="Hello from VoxCPM!"))
sf.write("out.wav", np.concatenate(chunks), 48000)
server.stop()在 NVIDIA RTX 4090 上 RTF 低至 ~0.13(标准 PyTorch 实现约为 ~0.3),支持批量并发请求和 FastAPI HTTP 服务器。部署详情请参见 Nano-vLLM-VoxCPM 仓库。
对于生产环境多租户部署,请使用 vLLM-Omni——vLLM 官方项目的全模态扩展,原生支持 VoxCPM2。具备 PagedAttention KV 缓存、持续批处理功能,以及可直接替换的 OpenAI 兼容 /v1/audio/speech 端点。
# Install from source (latest main — vllm-omni is rapidly evolving)
uv pip install vllm==0.19.0 --torch-backend=auto
git clone https://github.com/vllm-project/vllm-omni.git && cd vllm-omni
uv pip install -e .其他平台(ROCm、XPU、MUSA、NPU)和 Docker 镜像的安装方法,请参见 vLLM-Omni 安装指南。
# Launch an OpenAI-compatible TTS server (--omni enables omni-modal serving)
vllm serve openbmb/VoxCPM2 --omni --port 8000
# Call it from any OpenAI client
curl http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"openbmb/VoxCPM2","input":"Hello from VoxCPM2 on vLLM-Omni!","voice":"default"}' \
--output out.wav基于上游 vLLM 调度器构建,支持批量并发请求、流式分片传输以及开箱即用的多 GPU 部署。完整部署方案请参见 VoxCPM2 示例。
| VoxCPM2 | VoxCPM1.5 | VoxCPM-0.5B | |
|---|---|---|---|
| 状态 | 🟢 最新版 | 稳定版 | 旧版 |
| 主干模型参数 | 20 亿 | 6 千万 | 5 千万 |
| 音频采样率 | 48kHz | 44.1kHz | 16kHz |
| 语言模型令牌率 | 6.25Hz | 6.25Hz | 12.5Hz |
| 支持语言 | 30 种 | 2 种(中、英) | 2 种(中、英) |
| 克隆模式 | 独立参考与续写 | 仅支持续写 | 仅支持续写 |
| 声音设计 | ✅ | — | — |
| 可控声音克隆 | ✅ | — | — |
| SFT / LoRA | ✅ | ✅ | ✅ |
| 实时因子(RTX 4090) | ~0.30 | ~0.15 | ~0.17 |
| Nano-VLLM 实时因子(RTX 4090) | ~0.13 | ~0.08 | ~0.10 |
| 显存占用 | ~8 GB | ~6 GB | ~5 GB |
| 模型权重 | 🤗 HF / MS | 🤗 HF / MS | 🤗 HF / MS |
| 技术报告 | 即将发布 | — | arXiv ICLR 2026 |
| 演示页面 | 音频示例 | — | 音频示例 |
VoxCPM2 基于无令牌器、扩散自回归范式构建。该模型完全在 AudioVAE V2 的 latent 空间中运行,遵循四阶段流水线:LocEnc → TSLM → RALM → LocDiT,可实现丰富的表现力和 48kHz 原生音频输出。
有关完整的架构细节、VoxCPM2 特定升级以及模型对比表,请参见 架构设计。
VoxCPM2 在公开的零样本和可控 TTS 基准测试中取得了最先进或相当的结果。
| 模型 | 参数规模 | 是否开源 | 测试集-英文 | 测试集-中文 | 测试集-困难 | |||
|---|---|---|---|---|---|---|---|---|
| WER/%↓ | SIM/%↑ | CER/%↓ | SIM/%↑ | CER/%↓ | SIM/%↑ | |||
| MegaTTS3 | 0.5B | ❌ | 2.79 | 77.1 | 1.52 | 79.0 | - | - |
| DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 | - | - |
| CosyVoice3 | 0.5B | ❌ | 2.02 | 71.8 | 1.16 | 78.0 | 6.08 | 75.8 |
| CosyVoice3 | 1.5B | ❌ | 2.22 | 72.0 | 1.12 | 78.1 | 5.83 | 75.8 |
| Seed-TTS | - | ❌ | 2.25 | 76.2 | 1.12 | 79.6 | 7.59 | 77.6 |
| MiniMax-Speech | - | ❌ | 1.65 | 69.2 | 0.83 | 78.3 | - | - |
| F5-TTS | 0.3B | ✅ | 2.00 | 67.0 | 1.53 | 76.0 | 8.67 | 71.3 |
| MaskGCT | 1B | ✅ | 2.62 | 71.7 | 2.27 | 77.4 | - | - |
| CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 | 11.75 | 70.9 |
| CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 | 6.83 | 72.4 |
| SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66.0 | - | - |
| FireRedTTS | 0.5B | ✅ | 3.82 | 46.0 | 1.51 | 63.5 | 17.45 | 62.1 |
| FireRedTTS-2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 | - | - |
| Qwen2.5-Omni | 7B | ✅ | 2.72 | 63.2 | 1.70 | 75.2 | 7.97 | 74.7 |
| Qwen3-Omni | 30B-A3B | ✅ | 1.39 | - | 1.07 | - | - | - |
| OpenAudio-s1-mini | 0.5B | ✅ | 1.94 | 55.0 | 1.18 | 68.5 | 23.37 | 64.3 |
| IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 | 7.12 | 75.5 |
| VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 | - | - |
| HiggsAudio-v2 | 3B | ✅ | 2.44 | 67.7 | 1.50 | 74.0 | 55.07 | 65.6 |
| VoxCPM-0.5B | 0.6B | ✅ | 1.85 | 72.9 | 0.93 | 77.2 | 8.87 | 73.0 |
| VoxCPM1.5 | 0.8B | ✅ | 2.12 | 71.4 | 1.18 | 77.0 | 7.74 | 73.1 |
| MOSS-TTS | ✅ | 1.85 | 73.4 | 1.20 | 78.8 | - | - | |
| Qwen3-TTS | 1.7B | ✅ | 1.23 | 71.7 | 1.22 | 77.0 | 6.76 | 74.8 |
| FishAudio S2 | 4B | ✅ | 0.99 | - | 0.54 | - | 5.99 | - |
| LongCat-Audio-DiT | 3.5B | ✅ | 1.50 | 78.6 | 1.09 | 81.8 | 6.04 | 79.7 |
| VoxCPM2 | 2B | ✅ | 1.84 | 75.3 | 0.97 | 79.5 | 8.13 | 75.3 |
| 模型 | zh | en | hard-zh | hard-en | ja | ko | de | es | fr | it | ru |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CosyVoice2 | 4.08 | 6.32 | 12.58 | 11.96 | 9.13 | 19.7 | - | - | - | - | - |
| CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 10.55 | 7.57 | 5.69 | 6.43 | 4.47 | 11.8 | 10.5 | 6.64 |
| Fish Audio S2 | 2.65 | 2.43 | 9.10 | 4.40 | 3.96 | 2.76 | 2.22 | 2.00 | 6.26 | 2.04 | 2.78 |
| VoxCPM2 | 3.65 | 5.00 | 8.55 | 8.48 | 5.96 | 5.69 | 4.77 | 3.80 | 9.85 | 4.25 | 5.21 |
| 语言 | Minimax | ElevenLabs | Qwen3-TTS | FishAudio S2 | VoxCPM2 |
|---|---|---|---|---|---|
| Arabic | 1.665 | 1.666 | – | 3.500 | 13.046 |
| Cantonese | 34.111 | 51.513 | – | 30.670 | 38.584 |
| Chinese | 2.252 | 16.026 | 0.928 | 0.730 | 1.136 |
| Czech | 3.875 | 2.108 | – | 2.840 | 24.132 |
| Dutch | 1.143 | 0.803 | – | 0.990 | 0.913 |
| English | 2.164 | 2.339 | 0.934 | 1.620 | 2.289 |
| Finnish | 4.666 | 2.964 | – | 3.330 | 2.632 |
| French | 4.099 | 5.216 | 2.858 | 3.050 | 4.534 |
| German | 1.906 | 0.572 | 1.235 | 0.550 | 0.679 |
| Greek | 2.016 | 0.991 | – | 5.740 | 2.844 |
| Hindi | 6.962 | 5.827 | – | 14.640 | 19.699 |
| Indonesian | 1.237 | 1.059 | – | 1.460 | 1.084 |
| Italian | 1.543 | 1.743 | 0.948 | 1.270 | 1.563 |
| Japanese | 3.519 | 10.646 | 3.823 | 2.760 | 4.628 |
| Korean | 1.747 | 1.865 | 1.755 | 1.180 | 1.962 |
| Polish | 1.415 | 0.766 | – | 1.260 | 1.141 |
| Portuguese | 1.877 | 1.331 | 1.526 | 1.140 | 1.938 |
| Romanian | 2.878 | 1.347 | – | 10.740 | 21.577 |
| Russian | 4.281 | 3.878 | 3.212 | 2.400 | 3.634 |
| Spanish | 1.029 | 1.084 | 1.126 | 0.910 | 1.438 |
| Thai | 2.701 | 73.936 | – | 4.230 | 2.961 |
| Turkish | 1.52 | 0.699 | – | 0.870 | 0.817 |
| Ukrainian | 1.082 | 0.997 | – | 2.300 | 6.316 |
| Vietnamese | 0.88 | 73.415 | – | 7.410 | 3.307 |
| 语言 | Minimax | ElevenLabs | Qwen3-TTS | FishAudio S2 | VoxCPM2 |
|---|---|---|---|---|---|
| Arabic | 73.6 | 70.6 | – | 75.0 | 79.1 |
| Cantonese | 77.8 | 67.0 | – | 80.5 | 83.5 |
| Chinese | 78.0 | 67.7 | 79.9 | 81.6 | 82.5 |
| Czech | 79.6 | 68.5 | – | 79.8 | 78.3 |
| Dutch | 73.8 | 68.0 | – | 73.0 | 80.8 |
| English | 75.6 | 61.3 | 77.5 | 79.7 | 85.4 |
| Finnish | 83.5 | 75.9 | – | 81.9 | 89.0 |
| French | 62.8 | 53.5 | 62.8 | 69.8 | 73.5 |
| German | 73.3 | 61.4 | 77.5 | 76.7 | 80.3 |
| Greek | 82.6 | 73.3 | – | 79.5 | 86.0 |
| Hindi | 81.8 | 73.0 | – | 82.1 | 85.6 |
| Indonesian | 72.9 | 66.0 | – | 76.3 | 80.0 |
| Italian | 69.9 | 57.9 | 81.7 | 74.7 | 78.0 |
| Japanese | 77.6 | 73.8 | 78.8 | 79.6 | 82.8 |
| Korean | 77.6 | 70.0 | 79.9 | 81.7 | 83.3 |
| Polish | 80.2 | 72.9 | – | 81.9 | 88.4 |
| Portuguese | 80.5 | 71.1 | 81.7 | 78.1 | 83.7 |
| Romanian | 80.9 | 69.9 | – | 73.3 | 79.7 |
| Russian | 76.1 | 67.6 | 79.2 | 79.0 | 81.1 |
| Spanish | 76.2 | 61.5 | 81.4 | 77.6 | 83.1 |
| Thai | 80.0 | 58.8 | – | 78.6 | 84.0 |
| Turkish | 77.9 | 59.6 | – | 83.5 | 87.1 |
| Ukrainian | 73.0 | 64.7 | – | 74.7 | 79.8 |
| Vietnamese | 74.3 | 36.9 | – | 74.0 | 80.6 |
我们额外运行了一项内部多语言可懂度基准测试,包含30种语言×500个样本。ASR转录通过Gemini 3.1 Flash Lite API进行评估。
| 语言 | 指标 | VoxCPM2 | Fish S2-Pro |
|---|---|---|---|
| ar (阿拉伯语) | CER | 1.23% | 0.30% |
| da (丹麦语) | WER | 2.70% | 3.52% |
| de (德语) | WER | 0.96% | 0.64% |
| el (希腊语) | WER | 3.17% | 4.61% |
| en (英语) | WER | 0.42% | 1.03% |
| es (西班牙语) | WER | 1.33% | 0.64% |
| fi (芬兰语) | WER | 2.24% | 2.80% |
| fr (法语) | WER | 2.16% | 2.34% |
| he (希伯来语) | CER | 2.98% | 15.27% |
| hi (印地语) | CER | 0.79% | 0.91% |
| id (印尼语) | WER | 1.36% | 1.68% |
| it (意大利语) | WER | 1.65% | 1.08% |
| ja (日语) | CER | 2.40% | 1.82% |
| km (高棉语) | CER | 2.05% | 75.15% |
| ko (韩语) | CER | 0.95% | 0.29% |
| lo (老挝语) | CER | 1.90% | 87.40% |
| ms (马来语) | WER | 1.75% | 1.41% |
| my (缅甸语) | CER | 1.42% | 85.27% |
| nl (荷兰语) | WER | 1.25% | 1.68% |
| no (挪威语) | WER | 2.49% | 3.76% |
| pl (波兰语) | WER | 1.90% | 1.65% |
| pt (葡萄牙语) | WER | 1.48% | 1.49% |
| ru (俄语) | WER | 0.90% | 0.86% |
| sv (瑞典语) | WER | 2.22% | 2.63% |
| sw (斯瓦希里语) | CER | 1.07% | 2.02% |
| th (泰语) | CER | 0.94% | 1.92% |
| tl (他加禄语) | WER | 2.63% | 4.00% |
| tr (土耳其语) | WER | 1.65% | 1.65% |
| vi (越南语) | WER | 1.56% | 5.56% |
| zh (中文) | CER | 0.92% | 1.02% |
| 平均值(30种语言) | 1.68% | - |
| 模型 | InstructTTSEval-ZH | InstructTTSEval-EN | ||||
|---|---|---|---|---|---|---|
| APS⬆ | DSD⬆ | RP⬆ | APS⬆ | DSD⬆ | RP⬆ | |
| Hume | – | – | – | 83.0 | 75.3 | 54.3 |
| VoxInstruct | 47.5 | 52.3 | 42.6 | 54.9 | 57.0 | 39.3 |
| Parler-tts-mini | – | – | – | 63.4 | 48.7 | 28.6 |
| Parler-tts-large | – | – | – | 60.0 | 45.9 | 31.2 |
| PromptTTS | – | – | – | 64.3 | 47.2 | 31.4 |
| PromptStyle | – | – | – | 57.4 | 46.4 | 30.9 |
| VoiceSculptor | 75.7 | 64.7 | 61.5 | – | – | – |
| Mimo-Audio-7B-Instruct | 75.7 | 74.3 | 61.5 | 80.6 | 77.6 | 59.5 |
| Qwen3TTS-12Hz-1.7B-VD | 85.2 | 81.1 | 65.1 | 82.9 | 82.4 | 68.4 |
| VoxCPM2 | 85.2 | 71.5 | 60.8 | 84.2 | 83.2 | 71.4 |
VoxCPM 支持全量微调(SFT) 和LoRA 微调两种模式。仅需5–10 分钟的音频数据,即可适配特定说话人、语言或领域。
# LoRA fine-tuning (parameter-efficient, recommended)
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
# Full fine-tuning
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
# WebUI for training & inference
python lora_ft_webui.py # then open http://localhost:7860完整指南 → 微调指南(数据准备、配置、训练、LoRA 热切换、常见问题)
| 项目 | 描述 |
|---|---|
| Nano-vLLM | 高吞吐量快速 GPU 服务 |
| vLLM-Omni | 官方 vLLM 多模态服务,支持 VoxCPM2 — 采用 PagedAttention 技术,兼容 OpenAI API |
| VoxCPM.cpp | GGML/GGUF:CPU、CUDA、Vulkan 推理 |
| VoxCPM-ONNX | ONNX 导出,用于 CPU 推理 |
| VoxCPMANE | Apple Neural Engine 后端 |
| voxcpm_rs | Rust 重新实现 |
| ComfyUI-VoxCPM | ComfyUI 节点式工作流 |
| ComfyUI_RH_VoxCPM | 功能完备的 VoxCPM 2 ComfyUI 工作流,支持多说话人生成、LoRA 和自动语音识别(ASR) |
| ComfyUI-VoxCPMTTS | ComfyUI 文本转语音(TTS)扩展 |
| TTS WebUI | 基于浏览器的文本转语音(TTS)扩展 |
请参阅文档中的完整 生态系统。社区项目不由 OpenBMB 官方维护。开发了有趣的项目?提交 issue 或 PR 来添加吧!
如果您觉得VoxCPM对您有所帮助,请考虑引用我们的研究并为仓库点亮 ⭐ !
@article{voxcpm2_2026,
title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
author = {VoxCPM Team},
journal = {GitHub},
year = {2026},
}
@article{voxcpm2025,
title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation
and True-to-Life Voice Cloning},
author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
journal = {arXiv preprint arXiv:2509.24650},
year = {2025},
}VoxCPM 模型权重与代码基于 Apache-2.0 许可证开源。