VoxCPM是一款全新的无分词器文本转语音(TTS)系统,重新定义了语音合成的真实感。它通过在连续空间中对语音进行建模,克服了离散分词的局限性,实现了两项旗舰功能:上下文感知语音生成和高保真零样本语音克隆。
与将语音转换为离散标记的主流方法不同,VoxCPM采用端到端扩散自回归架构,直接从文本生成连续语音表示。该系统基于MiniCPM-4骨干网络构建,通过层级语言建模和FSQ约束实现隐式语义-声学解耦,显著提升了表现力和生成稳定性。
pip install voxcpm默认情况下,首次运行脚本时模型会自动下载,您也可以提前手动下载模型。
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B",local_files_only=local_files_only)from modelscope import snapshot_download
snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')
snapshot_download('iic/SenseVoiceSmall')import soundfile as sf
from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
wav = model.generate(
text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.",
prompt_wav_path=None, # optional: path to a prompt speech for voice cloning
prompt_text=None, # optional: reference text
cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse
inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed
normalize=True, # enable external TN tool
denoise=True, # enable external Denoise tool
retry_badcase=True, # enable retrying mode for some bad cases (unstoppable)
retry_badcase_max_times=3, # maximum retrying times
retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech
)
sf.write("output.wav", wav, 16000)
print("saved: output.wav")安装完成后,入口点为 voxcpm(或使用 python -m voxcpm.cli)。
# 1) Direct synthesis (single text)
voxcpm --text "Hello VoxCPM" --output out.wav
# 2) Voice cloning (reference audio + transcript)
voxcpm --text "Hello" \
--prompt-audio path/to/voice.wav \
--prompt-text "reference transcript" \
--output out.wav \
--denoise
# 3) Batch processing (one text per line)
voxcpm --input examples/input.txt --output-dir outs
# (optional) Batch + cloning
voxcpm --input examples/input.txt --output-dir outs \
--prompt-audio path/to/voice.wav \
--prompt-text "reference transcript" \
--denoise
# 4) Inference parameters (quality/speed)
voxcpm --text "..." --output out.wav \
--cfg-value 2.0 --inference-timesteps 10 --normalize
# 5) Model loading
# Prefer local path
voxcpm --text "..." --output out.wav --model-path /path/to/VoxCPM_model_dir
# Or from Hugging Face (auto download/cache)
voxcpm --text "..." --output out.wav \
--hf-model-id openbmb/VoxCPM-0.5B --cache-dir ~/.cache/huggingface --local-files-only
# 6) Denoiser control
voxcpm --text "..." --output out.wav \
--no-denoiser --zipenhancer-path iic/speech_zipenhancer_ans_multiloss_16k_base
# 7) Help
voxcpm --help
python -m voxcpm.cli --help运行 python app.py 即可启动用户界面,通过该界面可进行声音克隆和声音创建操作。
欢迎来到 VoxCPM 厨房!按照这份食谱,你将能“烹饪”出完美的生成语音。让我们开始吧。
首先,选择你喜欢的文本输入方式:
这是赋予音频独特声音的“秘制酱料”。
准备“上菜”了!但对于想要调整风味的大厨,这里有两种关键“调料”。
创作愉快!🎉 从默认设置开始,然后根据你的项目需求进行调整。厨房由你掌控!
VoxCPM 在公开的零样本 TTS 基准测试中取得了具有竞争力的结果:
| 模型 | 参数规模 | 是否开源 | 英文测试集 | 中文测试集 | 困难测试集 | |||
|---|---|---|---|---|---|---|---|---|
| 词错误率/%⬇ | 相似度/%⬆ | 字错误率/%⬇ | 相似度/%⬆ | 字错误率/%⬇ | 相似度/%⬆ | |||
| MegaTTS3 | 0.5B | ❌ | 2.79 | 77.1 | 1.52 | 79.0 | - | - |
| DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 | - | - |
| CosyVoice3 | 0.5B | ❌ | 2.02 | 71.8 | 1.16 | 78.0 | 6.08 | 75.8 |
| CosyVoice3 | 1.5B | ❌ | 2.22 | 72.0 | 1.12 | 78.1 | 5.83 | 75.8 |
| Seed-TTS | - | ❌ | 2.25 | 76.2 | 1.12 | 79.6 | 7.59 | 77.6 |
| MiniMax-Speech | - | ❌ | 1.65 | 69.2 | 0.83 | 78.3 | - | - |
| CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 | 11.75 | 70.9 |
| CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 | 6.83 | 72.4 |
| F5-TTS | 0.3B | ✅ | 2.00 | 67.0 | 1.53 | 76.0 | 8.67 | 71.3 |
| SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66.0 | - | - |
| FireRedTTS | 0.5B | ✅ | 3.82 | 46.0 | 1.51 | 63.5 | 17.45 | 62.1 |
| FireRedTTS-2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 | - | - |
| Qwen2.5-Omni | 7B | ✅ | 2.72 | 63.2 | 1.70 | 75.2 | 7.97 | 74.7 |
| OpenAudio-s1-mini | 0.5B | ✅ | 1.94 | 55.0 | 1.18 | 68.5 | - | - |
| IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 | - | - |
| VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 | - | - |
| HiggsAudio-v2 | 3B | ✅ | 2.44 | 67.7 | 1.50 | 74.0 | - | - |
| VoxCPM | 0.5B | ✅ | 1.85 | 72.9 | 0.93 | 77.2 | 8.87 | 73.0 |
| 模型 | 中文 | 英文 | 困难中文 | 困难英文 | ||||
|---|---|---|---|---|---|---|---|---|
| 字错误率/%⬇ | 词错误率/%⬇ | 字错误率/%⬇ | 相似度/%⬆ | 语音质量评分⬆ | 词错误率/%⬇ | 相似度/%⬆ | 语音质量评分⬆ | |
| F5-TTS | 5.47 | 8.90 | - | - | - | - | - | - |
| SparkTTS | 5.15 | 11.0 | - | - | - | - | - | - |
| GPT-SoVits | 7.34 | 12.5 | - | - | - | - | - | - |
| CosyVoice2 | 4.08 | 6.32 | 12.58 | 72.6 | 3.81 | 11.96 | 66.7 | 3.95 |
| OpenAudio-s1-mini | 4.00 | 5.54 | 18.1 | 58.2 | 3.77 | 12.4 | 55.7 | 3.89 |
| IndexTTS2 | 3.58 | 4.45 | 12.8 | 74.6 | 3.65 | - | - | - |
| HiggsAudio-v2 | 9.54 | 7.89 | 41.0 | 60.2 | 3.39 | 10.3 | 61.8 | 3.68 |
| CosyVoice3-0.5B | 3.89 | 5.24 | 14.15 | 78.6 | 3.75 | 9.04 | 75.9 | 3.92 |
| CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 78.5 | 3.79 | 10.55 | 76.1 | 3.95 |
| VoxCPM | 3.40 | 4.04 | 12.9 | 66.1 | 3.59 | 7.89 | 64.3 | 3.74 |
VoxCPM模型权重和代码基于Apache-2.0许可证开源。