Step-Audio 2 是一款端到端的多模态大型语言模型,专为工业级音频理解和语音对话而设计。
先进的语音与音频理解能力:通过对语义信息、副语言信息及非语音信息的全面理解与推理,在自动语音识别(ASR)和音频理解任务中展现出卓越性能。
智能语音对话交互:能够针对多样化的对话场景和副语言信息,实现自然且智能的上下文适配交互。
工具调用与多模态检索增强生成(RAG):借助工具调用与RAG技术获取真实世界的文本及声学知识,Step-Audio 2可在多场景下生成幻觉更少的响应,同时具备基于检索语音切换音色的能力。
业界领先的性能表现:在各类音频理解与对话基准测试中,相较于其他开源及商业解决方案,均达到了业界领先水平(详见评估与技术报告)。
| 模型 | 🤗 Hugging Face |
|---|---|
| Step-Audio 2 mini | stepfun-ai/Step-Audio-2-mini |
| Step-Audio 2 mini Base | stepfun-ai/Step-Audio-2-mini-Base |
conda create -n stepaudio2 python=3.10
conda activate stepaudio2
pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml
git clone https://github.com/stepfun-ai/Step-Audio2.git
cd Step-Audio2
git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Basepython examples-base.py
您可以扫描以下二维码加入我们的微信群进行交流和讨论。
中文、粤语和日语采用字符错误率(CER),阿拉伯语和英语采用词错误率(WER)。N/A 表示不支持该语言。
| 类别 | 测试集 | Doubao LLM ASR | GPT-4o Transcribe | Kimi-Audio | Qwen-Omni | Step-Audio 2 | Step-Audio 2 mini |
|---|---|---|---|---|---|---|---|
| 英语 | Common Voice | 9.20 | 9.30 | 7.83 | 8.33 | 5.95 | 6.76 |
| FLEURS English | 7.22 | 2.71 | 4.47 | 5.05 | 3.03 | 3.05 | |
| LibriSpeech clean | 2.92 | 1.75 | 1.49 | 2.93 | 1.17 | 1.33 | |
| LibriSpeech other | 5.32 | 4.23 | 2.91 | 5.07 | 2.42 | 2.86 | |
| 平均值 | 6.17 | 4.50 | 4.18 | 5.35 | 3.14 | 3.50 | |
| 中文 | AISHELL | 0.98 | 3.52 | 0.64 | 1.17 | 0.63 | 0.78 |
| AISHELL-2 | 3.10 | 4.26 | 2.67 | 2.40 | 2.10 | 2.16 | |
| FLEURS Chinese | 2.92 | 2.62 | 2.91 | 7.01 | 2.68 | 2.53 | |
| KeSpeech phase1 | 6.48 | 26.80 | 5.11 | 6.45 | 3.63 | 3.97 | |
| WenetSpeech meeting | 4.90 | 31.40 | 5.21 | 6.61 | 4.75 | 4.87 | |
| WenetSpeech net | 4.46 | 15.71 | 5.93 | 5.24 | 4.67 | 4.82 | |
| 平均值 | 3.81 | 14.05 | 3.75 | 4.81 | 3.08 | 3.19 | |
| 多语言 | FLEURS Arabian | N/A | 11.72 | N/A | 25.13 | 14.22 | 16.46 |
| Common Voice yue | 9.20 | 11.10 | 38.90 | 7.89 | 7.90 | 8.32 | |
| FLEURS Japanese | N/A | 3.27 | N/A | 10.49 | 3.18 | 4.67 | |
| 内部测试集 | 安徽口音 | 8.83 | 50.55 | 22.17 | 18.73 | 10.61 | 11.65 |
| 广东口音 | 4.99 | 7.83 | 3.76 | 4.03 | 3.81 | 4.44 | |
| 广西口音 | 3.37 | 7.09 | 4.29 | 3.35 | 4.11 | 3.51 | |
| 山西口音 | 20.26 | 55.03 | 34.71 | 25.95 | 12.44 | 15.60 | |
| 四川方言 | 3.01 | 32.85 | 5.26 | 5.61 | 4.35 | 4.57 | |
| 上海方言 | 47.49 | 89.58 | 82.90 | 58.74 | 17.77 | 19.30 | |
| 平均值 | 14.66 | 40.49 | 25.52 | 19.40 | 8.85 | 9.85 |
StepEval-Audio-Paralinguistic
| 模型 | 平均值 | 性别 | 年龄 | 音色 | 场景 | 事件 | 情感 | 音调 | 节奏 | 语速 | 风格 | 人声 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o Audio | 43.45 | 18 | 42 | 34 | 22 | 14 | 82 | 40 | 60 | 58 | 64 | 44 |
| Kimi-Audio | 49.64 | 94 | 50 | 10 | 30 | 48 | 66 | 56 | 40 | 44 | 54 | 54 |
| Qwen-Omni | 44.18 | 40 | 50 | 16 | 28 | 42 | 76 | 32 | 54 | 50 | 50 | 48 |
| Step-Audio-AQAA | 36.91 | 70 | 66 | 18 | 14 | 14 | 40 | 38 | 48 | 54 | 44 | 0 |
| Step-Audio 2 | 83.09 | 100 | 96 | 82 | 78 | 60 | 86 | 82 | 86 | 88 | 88 | 68 |
| Step-Audio 2 mini | 80.00 | 100 | 94 | 80 | 78 | 60 | 82 | 82 | 68 | 74 | 86 | 76 |
MMAU
| 模型 | 平均值 | 环境音 | 语音 | 音乐 |
|---|---|---|---|---|
| Audio Flamingo 3 | 73.1 | 76.9 | 66.1 | 73.9 |
| Gemini 2.5 Pro | 71.6 | 75.1 | 71.5 | 68.3 |
| GPT-4o Audio | 58.1 | 58.0 | 64.6 | 51.8 |
| Kimi-Audio | 69.6 | 79.0 | 65.5 | 64.4 |
| Omni-R1 | 77.0 | 81.7 | 76.0 | 73.4 |
| Qwen2.5-Omni | 71.5 | 78.1 | 70.6 | 65.9 |
| Step-Audio-AQAA | 49.7 | 50.5 | 51.4 | 47.3 |
| Step-Audio 2 | 78.0 | 83.5 | 76.9 | 73.7 |
| Step-Audio 2 mini | 73.2 | 76.6 | 71.5 | 71.6 |
| 模型 | CoVoST 2(语音到文本) | ||
|---|---|---|---|
| 平均值 | 英语到中文 | 中文到英语 | |
| GPT-4o Audio | 29.61 | 40.20 | 19.01 |
| Qwen2.5-Omni | 35.40 | 41.40 | 29.40 |
| Step-Audio-AQAA | 28.57 | 37.71 | 19.43 |
| Step-Audio 2 | 39.26 | 49.01 | 29.51 |
| Step-Audio 2 mini | 39.29 | 49.12 | 29.47 |
| 模型 | CVSS(语音到语音) | ||
|---|---|---|---|
| 平均值 | 英语到中文 | 中文到英语 | |
| GPT-4o Audio | 23.68 | 20.07 | 27.29 |
| Qwen-Omni | 15.35 | 8.04 | 22.66 |
| Step-Audio-AQAA | 27.36 | 30.74 | 23.98 |
| Step-Audio 2 | 30.87 | 34.83 | 26.92 |
| Step-Audio 2 mini | 29.08 | 32.81 | 25.35 |
StepEval-Audio-Toolcall。日期和时间工具无参数。
| 模型 | 目标 | 指标 | 音频搜索 | 日期和时间 | 天气 | 网络搜索 |
|---|---|---|---|---|---|---|
| Qwen3-32B† | 触发 | 精确率/召回率 | 67.5 / 98.5 | 98.4 / 100.0 | 90.1 / 100.0 | 86.8 / 98.5 |
| 类型 | 准确率 | 100.0 | 100.0 | 98.5 | 98.5 | |
| 参数 | 准确率 | 100.0 | N/A | 100.0 | 100.0 | |
| Step-Audio 2 | 触发 | 精确率/召回率 | 86.8 / 99.5 | 96.9 / 98.4 | 92.2 / 100.0 | 88.4 / 95.5 |
| 类型 | 准确率 | 100.0 | 100.0 | 90.5 | 98.4 | |
| 参数 | 准确率 | 100.0 | N/A | 100.0 | 100.0 |
URO-Bench。U.R.O. 分别代表理解(Understanding)、推理(Reasoning)和口头对话(Oral conversation)。
| 模型 | 语言 | 基础版 | 专业版 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 平均值 | U. | R. | O. | 平均值 | U. | R. | O. | ||
| GPT-4o Audio | 中文 | 78.59 | 89.40 | 65.48 | 85.24 | 67.10 | 70.60 | 57.22 | 70.20 |
| Kimi-Audio | 73.59 | 79.34 | 64.66 | 79.75 | 66.07 | 60.44 | 59.29 | 76.21 | |
| Qwen-Omni | 68.98 | 59.66 | 69.74 | 77.27 | 59.11 | 59.01 | 59.82 | 58.74 | |
| Step-Audio-AQAA | 74.71 | 87.61 | 59.63 | 81.93 | 65.61 | 74.76 | 47.29 | 68.97 | |
| Step-Audio 2 | 83.32 | 91.05 | 75.45 | 86.08 | 68.25 | 74.78 | 63.18 | 65.10 | |
| Step-Audio 2 mini | 77.81 | 89.19 | 64.53 | 84.12 | 69.57 | 76.84 | 58.90 | 69.42 | |
| GPT-4o Audio | 英文 | 84.54 | 90.18 | 75.90 | 90.41 | 67.51 | 60.65 | 64.36 | 78.46 |
| Kimi-Audio | 60.04 | 83.36 | 42.31 | 60.36 | 49.79 | 50.32 | 40.59 | 56.04 | |
| Qwen-Omni | 70.58 | 66.29 | 69.62 | 76.16 | 50.99 | 44.51 | 63.88 | 49.41 | |
| Step-Audio-AQAA | 71.11 | 90.15 | 56.12 | 72.06 | 52.01 | 44.25 | 54.54 | 59.81 | |
| Step-Audio 2 | 83.90 | 92.72 | 76.51 | 84.92 | 66.07 | 64.86 | 67.75 | 66.33 | |
| Step-Audio 2 mini | 74.36 | 90.07 | 60.12 | 77.65 | 61.25 | 58.79 | 61.94 | 63.80 | |
本仓库中的模型和代码采用 Apache 2.0 许可证授权。
@misc{wu2025stepaudio2technicalreport,
title={Step-Audio 2 Technical Report},
author={Boyong Wu and Chao Yan and Chen Hu and Cheng Yi and Chengli Feng and Fei Tian and Feiyu Shen and Gang Yu and Haoyang Zhang and Jingbei Li and Mingrui Chen and Peng Liu and Wang You and Xiangyu Tony Zhang and Xingyuan Li and Xuerui Yang and Yayue Deng and Yechang Huang and Yuxin Li and Yuxin Zhang and Zhao You and Brian Li and Changyi Wan and Hanpeng Hu and Jiangjie Zhen and Siyu Chen and Song Yuan and Xuelin Zhang and Yimin Jiang and Yu Zhou and Yuxiang Yang and Bingxin Li and Buyun Ma and Changhe Song and Dongqing Pang and Guoqiang Hu and Haiyang Sun and Kang An and Na Wang and Shuli Gao and Wei Ji and Wen Li and Wen Sun and Xuan Wen and Yong Ren and Yuankai Ma and Yufan Lu and Bin Wang and Bo Li and Changxin Miao and Che Liu and Chen Xu and Dapeng Shi and Dingyuan Hu and Donghang Wu and Enle Liu and Guanzhe Huang and Gulin Yan and Han Zhang and Hao Nie and Haonan Jia and Hongyu Zhou and Jianjian Sun and Jiaoren Wu and Jie Wu and Jie Yang and Jin Yang and Junzhe Lin and Kaixiang Li and Lei Yang and Liying Shi and Li Zhou and Longlong Gu and Ming Li and Mingliang Li and Mingxiao Li and Nan Wu and Qi Han and Qinyuan Tan and Shaoliang Pang and Shengjie Fan and Siqi Liu and Tiancheng Cao and Wanying Lu and Wenqing He and Wuxun Xie and Xu Zhao and Xueqi Li and Yanbo Yu and Yang Yang and Yi Liu and Yifan Lu and Yilei Wang and Yuanhao Ding and Yuanwei Liang and Yuanwei Lu and Yuchu Luo and Yuhe Yin and Yumeng Zhan and Yuxiang Zhang and Zidong Yang and Zixin Zhang and Binxing Jiao and Daxin Jiang and Heung-Yeung Shum and Jiansheng Chen and Jing Li and Xiangyu Zhang and Yibo Zhu},
year={2025},
eprint={2507.16632},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.16632},
}