超越转录：面向感知感知音频大语言模型的统一音频架构

Unified Audio Schema 是一种新颖的整体音频监督框架，它能够对转录、副语言和非语言事件的监督进行解耦和重组。

本仓库提供了使用Unified Audio Schema训练的模型 checkpoint。完整代码库请参考相应的GitHub 仓库。

模型详情

属性	值
输入模态	文本和音频
输出模态	文本和音频
基础大语言模型	Qwen2.5-7B
音频编码器	AuT 编码器
输入音频表示帧率	12.5 Hz
输出音频令牌码本大小	8,192
输出音频令牌帧率	25 Hz

注意事项：

该模型支持文本和音频的交错输入/输出，可实现灵活的多模态交互。
生成音频令牌的语音波形重建依赖于 StableToken 解码器。

快速开始

安装

git clone --recursive https://github.com/Tencent/Unified_Audio_Schema.git
cd Unified_Audio_Schema && pip install -r requirements.txt

下载检查点

# Model weights
huggingface-cli download tencent/Unified_Audio_Schema --local-dir checkpoints/Unified_Audio_Schema

# StableToken decoder (required for speech waveform reconstruction)
huggingface-cli download tencent/StableToken --local-dir checkpoints/StableToken

推理

import torch
import torchaudio
from src.model import UASAudio

model = UASAudio(
	model_path="checkpoints/Unified_Audio_Schema",
	audio_decoder_path="checkpoints/StableToken/decoder",
	device="cuda" if torch.cuda.is_available() else "cpu",
)

dialogue_system_prompt = (
	"User will provide you with a speech instruction. Do it step by step. "
	"First, think about the instruction and respond in a interleaved manner, "
	"with 13 text token followed by 52 audio tokens."
)

messages = [
	{"role": "system", "content": dialogue_system_prompt},
	{
		"role": "user",
		"content": [
			{"type": "audio", "audio": "assets/give_me_a_brief_introduction_to_the_great_wall.wav"},
		],
	},
	{"role": "assistant", "content": None},
]

generation_config = {
    "max_new_tokens": 4096,
    "temperature": 0.7,
    "repetition_penalty": 1.05,
    "top_p": 0.9,
    "do_sample": True
}

_, text, audio_tokens = model(messages, **generation_config)
print(text)

if len(audio_tokens) > 0:
	audio_array, sampling_rate = model.tokens_to_audio(audio_tokens)
	torchaudio.save("response.wav", audio_array, sampling_rate)

支持场景

我们的模型可应用于多种音频理解与生成任务，包括：

文本输入对话
语音输入对话
自动语音识别（ASR）
音频 captioning
文本转语音（TTS）

更多可运行示例，请参考 GitHub 仓库中的 example_usage.ipynb。

评估亮点

UAS-Audio 在音频理解、ASR 和 TTS 基准测试中表现优异。

音频理解

模型	MMSU (Percep.)	MMSU (Reason.)	MMSU (Overall)	MMAR (Speech)	MMAR (Sound)	MMAR (Music)	MMAR (Overall)	MMAU (Speech)	MMAU (Sound)	MMAU (Music)	MMAU (Overall)	Avg.
Kimi-Audio	44.8	75.7	59.8	58.5	49.7	33.0	48.0	62.2	75.7	66.8	68.2	58.7
Qwen2.5-Omni	42.7	77.6	58.1	59.9	58.8	40.8	56.7	70.6	78.1	65.9	71.5	62.1
Step-Audio2	42.9	73.2	57.6	61.2	54.6	42.2	56.8	68.2	79.3	68.4	72.7	61.9
Ours	55.7	77.4	66.2	66.0	58.8	45.2	60.1	67.0	70.0	71.3	69.4	65.2

ASR 与 TTS

模型	ASR (LS-clean)	ASR (AISHELL-1)	TTS (SeedTTS-en)	TTS (SeedTTS-zh)
Qwen2.5-Omni	-	-	2.3	1.4
Step-Audio2	1.9	1.0	2.1	3.2
MiMo-Audio	3.8	1.8	5.4	2.0
Ours	2.2	2.3	1.7	1.4

引用说明

如果您发现Unified Audio Schema或我们的模型对您的研究有所帮助，请引用：

@misc{zhang2026transcriptionunifiedaudioschema,
	title={Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs}, 
	author={Linhao Zhang and Yuhan Song and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou},
	year={2026},
	eprint={2604.12506},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2604.12506},
}

@inproceedings{song2026stabletoken,
	title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient Speech{LLM}s},
	author={Yuhan Song and Linhao Zhang and Chuhan Wu and Aiwei Liu and Wei Jia and Houfeng Wang and Zhou Xiao},
	booktitle={The Fourteenth International Conference on Learning Representations},
	year={2026},
	url={https://openreview.net/forum?id=17DNmdQ9aU}
}

许可协议

本项目基于 Unified_Audio_Schema 的许可条款进行许可。