Scenema Audio

零样本富有表现力的语音克隆与语音生成。

现有的所有文本转语音系统都能将文字转换为声音，但没有一个能真正“表演”。Scenema Audio 生成的语音具有意图、节奏控制、呼吸调节以及在单次生成中就能转变的情感弧线，这一切都源于一个文本提示——它不仅描述要说什么，还描述要怎么说。

Scenema Audio 构建于从 LTX 2.3 的 220 亿参数视听模型中提取的音频扩散转换器之上，它学习了人们在真实场景中的真实声音：愤怒、大笑、低语、哭泣、疲惫、恐惧。

功能特点

情感演绎：愤怒、悲伤、喜悦、恐惧、疲惫。通过动作标签，可在单次生成中实现情感状态的转变。
儿童 voices：六岁儿童、学步幼儿、青少年。声音自然，并非成人声音的变调。
场景感知音频：描述环境，模型即可生成带有雨声、雷声、人群声或任何环境音的语音。
零样本语音克隆：提供 10-20 秒具有一定情感变化的参考音频。模型能将该 voice 身份迁移到任何情感表演中。无需微调，无需注册。
长文本旁白：通过自动拆分文本并保持各片段间的 voice 连续性，生成任意长度的音频。
多语言支持：英语、德语、法语、西班牙语、意大利语、葡萄牙语、日语、中文、韩语、俄语、阿拉伯语、印地语、斯瓦希里语。

模型检查点

文件	大小	描述
`scenema-audio-transformer.safetensors`	9.8 GB	音频扩散转换器（bf16）
`scenema-audio-transformer-int8.safetensors`	4.9 GB	音频扩散转换器（INT8，质量相同）
`scenema-audio-pipeline.safetensors`	6.7 GB	音频 VAE 解码器 + 声码器 + 文本投影
`scenema-audio-vae-encoder.safetensors`	42.7 MB	用于参考 voice 编码的音频 VAE 编码器

快速开始

git clone https://github.com/ScenemaAI/scenema-audio.git
cd scenema-audio

export HF_TOKEN=your_huggingface_token
docker compose up

模型会在首次启动时下载（约38 GB）并缓存在 Docker 卷中。完整文档请参见 GitHub 仓库。

提示词格式

<speak voice="VOICE_DESCRIPTION" gender="male|female"
       scene="OPTIONAL_SCENE" language="OPTIONAL_LANG_CODE">
  <action>Performance direction.</action>
  Speech text here.
</speak>

属性	是否必填	默认值	描述
`voice`	是		详细的声音描述。决定 vocal 音质、情感、口音、年龄、音色和表达方式。
`gender`	是		`"male"` 或 `"female"`。控制已编译提示中的代词分配。
`scene`	否		环境背景。决定语音周围的环境音频。
`language`	否	`"en"`	语言代码。

声音描述

voice 属性是主要控制项。描述越丰富具体，效果越好：

声音特质：音色、音调、呼吸感、沙哑感、共鸣
情绪状态：愤怒、温柔、疲惫、兴奋、悲伤
说话风格：语速、重音、停顿、发音清晰度
角色原型：“想象托尼·索普拉诺情绪崩溃的样子”
年龄与性别：儿童、老年人、年轻女性、十几岁男孩
口音：英式、美国南方口音、新泽西意大利裔美国口音

动作标签

<action> 标签是舞台指示，用于塑造语音的表达方式。将它们放在语音片段之间，以指导情感转变、语速和发声状态：

<speak voice="Middle-aged man, warm but weathered." gender="male">
  <action>Calm, almost casual. Staring at his hands.</action>
  I used to think I had all the time in the world.
  <action>Voice tightens. Fighting to stay composed.</action>
  Then one Tuesday morning, the doctor said three words that changed everything.
  <action>Long pause. Deep breath. Raw but steady.</action>
  And I realized I hadn't called my son in six months.
</speak>

语音克隆

提供10-20秒具有一定情感变化的参考音频。模型将根据提示生成富有表现力的语音，并将参考语音的身份特征迁移到生成的语音表现中。

{
  "prompt": "<speak voice="\"Gravelly" male voice, fast talking, rough.\" gender="\"male\""><action>He completely loses it</action>What are you waiting for?!</speak>",
  "reference_voice_url": "https://example.com/reference.wav"
}

任何声音都能演绎任何情感，即便该声音从未在那种情绪状态下被录制过。

示例

情感表演

<speak voice="A man on the edge. Explosive rage. Italian-American inflection."
       gender="male" scene="A dimly lit office, late at night">
  <action>He stands up slowly, voice dangerously low</action>
  You come into my house, you eat my food, and then you got the nerve
  to tell me how to run my business.
  <action>Voice rising, finger pointing</action>
  I built this thing from nothing while you were sitting on your ass.
</speak>

童声

<speak voice="A six-year-old girl, bright and excited, speaking fast
with breathless enthusiasm. Slight lisp on S sounds."
gender="female">
  Mommy look! There is a rainbow and it goes all the way across the whole sky!
</speak>

场景感知音频

<speak voice="Male, mid 40s. Weathered. Urgent, projecting over wind."
       gender="male" scene="Open dock in a thunderstorm, heavy rain"
       shot="scene">
  <sound>Heavy rain and wind howling</sound>
  <action>He shouts over the storm</action>
  Get the lines! She is pulling loose!
  <sound>Thunder cracks overhead</sound>
  Move! I said move!
</speak>

API 参考

POST /generate

字段	类型	默认值	描述
`prompt`	string	必填	`<speak>` XML 字符串
`mode`	string	`"generate"`	`"generate"` 表示完整流程。`"voice_design"` 表示 15 秒语音预览。
`reference_voice_url`	string	`null`	用于零样本语音克隆的参考音频 URL。理想时长为 10-20 秒，且包含情感变化。
`background_sfx`	bool	`false`	在输出中保留生成的音效。
`validate`	bool	`true`	使用 Whisper 语音验证，若输出含混则重试。
`seed`	int	`-1`	生成种子。`-1` 表示随机。
`pace`	float	`1.5`	时长分配乘数。值越高，语速越慢。
`min_match_ratio`	float	`0.90`	Whisper 验证阈值（0.0-1.0）。
`skip_vc`	bool	`false`	跳过语音转换后处理。
`vc_steps`	int	`25`	SeedVC 扩散步数（10-50）。
`vc_cfg_rate`	float	`0.5`	SeedVC 引导率（0.0-1.0）。

响应

返回包含 base64 编码 WAV 音频的 JSON：

{
  "status": "succeeded",
  "audio": "<base64-encoded WAV>",
  "content_type": "audio/wav",
  "metadata": {
    "duration_s": 12.4,
    "sample_rate": 48000,
    "processing_ms": 8200,
    "seed": 42
  }
}

架构

XML prompt (voice + scene + action tags + text)
  -> Gemma 3 12B text encoding
  -> 8-step distilled latent diffusion
  -> Audio VAE decoding
  -> MelBandRoFormer vocal separation (strips SFX unless background_sfx=true)
  -> SeedVC voice identity transfer (when reference provided or multi-chunk)
  -> Output WAV (48kHz stereo)

对于较长文本，系统会使用Kokoro音素级时长估计在句子边界处进行分割，并通过A2V潜变量条件控制来保持片段之间的语音连续性。

显存要求

显存	音频模型	Gemma	说明
16 GB	INT8（4.9 GB）	CPU 流式处理	需要 32 GB 系统内存。每 chunk 编码约 7 秒。
24 GB	INT8（4.9 GB）	GPU 上的 NF4（约 8 GB）	默认配置。每 chunk 编码约 0.2 秒。
48 GB	bf16（9.8 GB）	GPU 上的 bf16（24 GB）	最佳质量。所有模型均驻留。

显存策略会自动检测。建议所有配置均使用 SageAttention 2。

性能

在 NVIDIA RTX 4090（24 GB）上进行基准测试，输出约 55 秒音频：

配置	总时间	实时因子
bf16 + bf16 流式处理	83秒	0.66x
INT8 + NF4（全 GPU）	35秒	1.57x

局限性

发音：偶尔会混淆复杂的多音节词和专有名词。
15秒生成窗口：每个片段上限约为 15 秒。较长文本会自动分割。
声音克隆的情感范围：身份迁移可能会降低情感极端程度。在声音描述中使用鲜明的原型，并提供具有自然情感变化的参考音频（10-20 秒，非单调）。
多语言发音：语音中途切换语言可能导致语音漂移。每种语言使用单独的请求。
生成速度：根据硬件不同，每 15 秒片段需要 3-8 秒。
参考音频质量：低质量参考会降低输出质量。使用清晰且具有一定情感变化的音频。
Gemma 3 12B 受限制：需要接受 Google 的使用条款，并拥有具有访问权限的 HuggingFace 令牌。

致谢

Lightricks 的 LTX-2，用于基础视听模型
Google 的 Gemma 3，用于文本编码器
Plachta 的 SeedVC，用于声音优化
hexgrad 的 Kokoro，用于时长估计
SageAttention，用于注意力加速

许可协议

模型权重根据 LTX-2 Community License Agreement 发布。Scenema Audio 的音频扩散转换器源自 LTX 2.3 的视听模型，其权重受相同条款约束。

推理代码和服务器根据 MIT License 发布。

Gemma 3 12B（文本编码器）是一个 gated 模型，需要接受 Google 的使用条款。