TADA：基于文本-声学双对齐的语音建模生成框架

一种统一的语音-语言模型，通过1:1对齐将语音和文本同步为单一连贯流。

文本-声学双对齐大型语言模型

TADA是一种统一的语音-语言模型，通过1:1对齐将语音和文本同步为单一连贯流。借助新颖的分词器和架构设计，TADA以传统模型所需计算开销的一小部分实现了高保真度的合成与生成。

⭐️ 论文：https://arxiv.org/abs/2602.23068
⭐️ 演示1：https://huggingface.co/spaces/fffiloni/tada-dual-alignment-tts-demo
⭐️ 演示2：https://huggingface.co/spaces/HumeAI/tada
⭐️ 代码库：https://github.com/HumeAI/tada
⭐️ 博客文章：https://www.hume.ai/blog/opensource-tada

核心特性

1:1 令牌对齐：与标准模型不同，TADA的分词器将音频编码为向量序列，其数量与文本令牌数量完全匹配。
动态时长合成：作为TTS模型，它在单个自回归步骤中为一个文本令牌生成完整的语音片段，无论其长度如何。这消除了对固定帧率处理的需求。
双流生成：在语音-语言建模模式下，它同时生成一个文本令牌和前一个令牌的语音，与纯文本生成相比，保持相同的上下文长度并实现最小开销。
高效性与可靠性：TADA在提供卓越表现力和自然流畅度的同时，显著降低了与固定音频帧率相关的计算成本。

工作原理

分词方案

TADA 通过确保每个单词或子词 token 都有唯一对应的语音向量来实现模态统一。这种同步流使模型能够“理解”语音相对于文本的精确时间关系。

动态自回归

大多数 TTS 模型生成一秒音频需要固定步数（例如每秒 50 帧）。TADA 打破了这一限制：

每个自回归步骤对应一个文本 token。
模型动态决定该特定 token 的时长和韵律。
这使得语音流更加自然，并消除了转录幻觉问题。

评估

前置条件

TADA 模型基于 Meta Llama 3.2 构建。使用 TADA 前，您必须申请访问 Llama 模型：

访问 meta-llama/Llama-3.2-1B 或 meta-llama/Llama-3.2-3B 并接受许可协议

安装

从 github 仓库安装

pip install git+https://github.com/HumeAI/tada.git

来自源

pip install -e .

模型

我们提供了多个模型检查点：

模型	基础模型	HuggingFace Hub
TADA-1B	Llama 3.2 1B	`HumeAI/tada-1b`
TADA-3B-ml	Llama 3.2 3B	`HumeAI/tada-3b-ml`

所有模型均使用相同的编码器（HumeAI/tada-codec），并可通过相同的API加载。

运行推理

文本转语音

import torch
import torchaudio

from tada.modules.encoder import Encoder, EncoderOutput
from tada.modules.tada import TadaForCausalLM

device = "cuda"

# Encoder is loaded separately (not inside the model)
encoder = Encoder.from_pretrained("HumeAI/tada-codec", subfolder="encoder").to(device)
model = TadaForCausalLM.from_pretrained("HumeAI/tada-3b-ml", torch_dtype=torch.bfloat16).to(device)

audio, sample_rate = torchaudio.load("samples/ljspeech.wav")
audio = audio.to(device)
prompt_text = "The examination and testimony of the experts, enabled the commission to conclude that five shots may have been fired."
prompt = encoder(
    audio, text=[prompt_text], sample_rate=sample_rate
)

# Optional: save prompt to skip encoder on future runs
# prompt.save("prompt_cache.pt")
# prompt = EncoderOutput.load("prompt_cache.pt", device=device)

output = model.generate(
    prompt=prompt,
    text="Please call Stella. Ask her to bring these things with her from the store.",
)

多语言生成

TADA 通过特定语言的对齐器支持多语言语音合成。加载编码器时传递 language 参数，即可为目标语言使用相应的对齐器。

import torch
import torchaudio

from tada.modules.encoder import Encoder
from tada.modules.tada import TadaForCausalLM

device = "cuda"
encoder = Encoder.from_pretrained("HumeAI/tada-codec", subfolder="encoder", language="ja").to(device)
model = TadaForCausalLM.from_pretrained("HumeAI/tada-3b-ml", torch_dtype=torch.bfloat16).to(device)

# Load a reference audio clip in the target language
audio, sample_rate = torchaudio.load("samples/ja_prompt.wav")
audio = audio.to(device)

# For non-English prompts, provide the transcript so the encoder uses forced alignment
# instead of the built-in ASR (which is English-only)
prompt_text = "このムキムキのお兄さんがいるし バーだし少し高そうだと思いますよねこのバーの料金設定は良心的でした まあそんなに高くなかったです"
prompt = encoder(audio, text=[prompt_text], sample_rate=sample_rate)

output = model.generate(
    prompt=prompt,
    text="今日はとても良い天気ですね。散歩に行きましょう。",
)

支持的语言：ar、ch、de、es、fr、it、ja、pl、pt。当未指定 language 时，将使用默认的英语对齐器。

注意： 对于非英语提示，您应通过 text 参数提供参考音频的文本记录。编码器内置的 ASR 仅支持英语。生成功能仍可使用，但对齐质量会下降。

您可以检查提示对齐情况，以确认其是否正确：

prompt.print_alignment(model.tokenizer)

这展示了 token 与音频对齐的点跨度可视化——点表示帧间隙，token 出现在其对齐位置：

34 tokens | 10.50s audio
······The··exam····ination··and·····test···imony··of···the

如果对齐看起来有问题（标记聚集在一起、标记缺失、文本无意义），请检查是否提供了正确的转录文本。
对于无法使用内置ASR的非英语提示，这一点尤为重要。

语音续写

如果您想生成提示的文本-语音续写，请提供num_extra_steps：

output = model.generate(
    prompt=prompt,
    num_extra_steps=50
)

📚 引用说明

如果您在研究中使用本项目，请引用我们的论文：

@article{dang2026tada,
  title={TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment},
  author={Dang, Trung and Rao, Sharath and Gupta, Ananya and Gagne, Christopher and Tzirakis, Panagiotis and Baird, Alice and Cłapa, Jakub Piotr and Chin, Peter and Cowen, Alan},
  journal={arXiv preprint arXiv:2602.23068},
  year={2026}
}

联系方式

Hume AI 是一家专注于共情 AI 的研究公司。我们致力于研发为 AI 模型赋予共情能力所需的数据集、工具和模型，以服务于人类福祉。如果您对我们的任何产品或研究合作感兴趣，请通过 hello@hume.ai 与我们联系。

致谢

本项目基于 Llama 3.2 构建。

Llama 3.2 根据 Llama 3.2 社区许可证授权。