阶跃星辰StepFun/Step-Audio-2-mini-Base
模型介绍文件和版本Pull Requests讨论分析
下载使用量0
GitHub   Homepage   Twitter Follow   Discord
 
  License

简介

Step-Audio 2 是一款端到端的多模态大型语言模型,专为工业级音频理解和语音对话而设计。

  • 先进的语音与音频理解能力:通过对语义信息、副语言信息及非语音信息的全面理解与推理,在自动语音识别(ASR)和音频理解任务中展现出卓越性能。

  • 智能语音对话交互:能够针对多样化的对话场景和副语言信息,实现自然且智能的上下文适配交互。

  • 工具调用与多模态检索增强生成(RAG):借助工具调用与RAG技术获取真实世界的文本及声学知识,Step-Audio 2可在多场景下生成幻觉更少的响应,同时具备基于检索语音切换音色的能力。

  • 业界领先的性能表现:在各类音频理解与对话基准测试中,相较于其他开源及商业解决方案,均达到了业界领先水平(详见评估与技术报告)。

  • 开源可用性:Step-Audio 2 mini 与 Step-Audio 2 mini Base 已基于 Apache 2.0 许可证开放。

模型下载

Huggingface

模型🤗 Hugging Face
Step-Audio 2 ministepfun-ai/Step-Audio-2-mini
Step-Audio 2 mini Basestepfun-ai/Step-Audio-2-mini-Base

模型使用

🔧 依赖项与安装

  • Python >= 3.10
  • PyTorch >= 2.3-cu121
  • CUDA Toolkit
conda create -n stepaudio2 python=3.10
conda activate stepaudio2
pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml

git clone https://github.com/stepfun-ai/Step-Audio2.git
cd Step-Audio2
git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base

🚀 推理脚本

python examples-base.py

在线演示

阶跃实时控制台

  • Step-Audio 2 和 Step-Audio 2 mini 均可在我们的 StepFun realtime console 中使用,且已启用网络搜索工具。
  • 您需要从 StepFun Open Platform 获取 API 密钥。

阶跃 AI 助手

  • Step-Audio 2 也可在我们的 StepFun AI Assistant 移动应用中使用,同时启用了网络和音频搜索工具。
  • 请扫描以下二维码从应用商店下载,然后点击右上角的电话图标。
QR code

微信群

您可以扫描以下二维码加入我们的微信群进行交流和讨论。

QR code

评估

Architecture

自动语音识别

中文、粤语和日语采用字符错误率(CER),阿拉伯语和英语采用词错误率(WER)。N/A 表示不支持该语言。

类别测试集Doubao LLM ASRGPT-4o TranscribeKimi-AudioQwen-OmniStep-Audio 2Step-Audio 2 mini
英语Common Voice9.209.307.838.335.956.76
FLEURS English7.222.714.475.053.033.05
LibriSpeech clean2.921.751.492.931.171.33
LibriSpeech other5.324.232.915.072.422.86
平均值6.174.504.185.353.143.50
中文AISHELL0.983.520.641.170.630.78
AISHELL-23.104.262.672.402.102.16
FLEURS Chinese2.922.622.917.012.682.53
KeSpeech phase16.4826.805.116.453.633.97
WenetSpeech meeting4.9031.405.216.614.754.87
WenetSpeech net4.4615.715.935.244.674.82
平均值3.8114.053.754.813.083.19
多语言FLEURS ArabianN/A11.72N/A25.1314.2216.46
Common Voice yue9.2011.1038.907.897.908.32
FLEURS JapaneseN/A3.27N/A10.493.184.67
内部测试集安徽口音8.8350.5522.1718.7310.6111.65
广东口音4.997.833.764.033.814.44
广西口音3.377.094.293.354.113.51
山西口音20.2655.0334.7125.9512.4415.60
四川方言3.0132.855.265.614.354.57
上海方言47.4989.5882.9058.7417.7719.30
平均值14.6640.4925.5219.408.859.85

副语言信息理解

StepEval-Audio-Paralinguistic

模型平均值性别年龄音色场景事件情感音调节奏语速风格人声
GPT-4o Audio43.451842342214824060586444
Kimi-Audio49.649450103048665640445454
Qwen-Omni44.184050162842763254505048
Step-Audio-AQAA36.91706618141440384854440
Step-Audio 283.0910096827860868286888868
Step-Audio 2 mini80.0010094807860828268748676

音频理解与推理

MMAU

模型平均值环境音语音音乐
Audio Flamingo 373.176.966.173.9
Gemini 2.5 Pro71.675.171.568.3
GPT-4o Audio58.158.064.651.8
Kimi-Audio69.679.065.564.4
Omni-R177.081.776.073.4
Qwen2.5-Omni71.578.170.665.9
Step-Audio-AQAA49.750.551.447.3
Step-Audio 278.083.576.973.7
Step-Audio 2 mini73.276.671.571.6

语音翻译

模型CoVoST 2(语音到文本)
平均值英语到中文中文到英语
GPT-4o Audio29.6140.2019.01
Qwen2.5-Omni35.4041.4029.40
Step-Audio-AQAA28.5737.7119.43
Step-Audio 239.2649.0129.51
Step-Audio 2 mini39.2949.1229.47
模型CVSS(语音到语音)
平均值英语到中文中文到英语
GPT-4o Audio23.6820.0727.29
Qwen-Omni15.358.0422.66
Step-Audio-AQAA27.3630.7423.98
Step-Audio 230.8734.8326.92
Step-Audio 2 mini29.0832.8125.35

工具调用

StepEval-Audio-Toolcall。日期和时间工具无参数。

模型目标指标音频搜索日期和时间天气网络搜索
Qwen3-32B†触发精确率/召回率67.5 / 98.598.4 / 100.090.1 / 100.086.8 / 98.5
类型准确率100.0100.098.598.5
参数准确率100.0N/A100.0100.0
Step-Audio 2触发精确率/召回率86.8 / 99.596.9 / 98.492.2 / 100.088.4 / 95.5
类型准确率100.0100.090.598.4
参数准确率100.0N/A100.0100.0

语音到语音对话

URO-Bench。U.R.O. 分别代表理解(Understanding)、推理(Reasoning)和口头对话(Oral conversation)。

模型语言基础版专业版
平均值U.R.O.平均值U.R.O.
GPT-4o Audio中文78.5989.4065.4885.2467.1070.6057.2270.20
Kimi-Audio73.5979.3464.6679.7566.0760.4459.2976.21
Qwen-Omni68.9859.6669.7477.2759.1159.0159.8258.74
Step-Audio-AQAA74.7187.6159.6381.9365.6174.7647.2968.97
Step-Audio 283.3291.0575.4586.0868.2574.7863.1865.10
Step-Audio 2 mini77.8189.1964.5384.1269.5776.8458.9069.42
GPT-4o Audio英文84.5490.1875.9090.4167.5160.6564.3678.46
Kimi-Audio60.0483.3642.3160.3649.7950.3240.5956.04
Qwen-Omni70.5866.2969.6276.1650.9944.5163.8849.41
Step-Audio-AQAA71.1190.1556.1272.0652.0144.2554.5459.81
Step-Audio 283.9092.7276.5184.9266.0764.8667.7566.33
Step-Audio 2 mini74.3690.0760.1277.6561.2558.7961.9463.80

许可证

本仓库中的模型和代码采用 Apache 2.0 许可证授权。

引用

@misc{wu2025stepaudio2technicalreport,
      title={Step-Audio 2 Technical Report},
      author={Boyong Wu and Chao Yan and Chen Hu and Cheng Yi and Chengli Feng and Fei Tian and Feiyu Shen and Gang Yu and Haoyang Zhang and Jingbei Li and Mingrui Chen and Peng Liu and Wang You and Xiangyu Tony Zhang and Xingyuan Li and Xuerui Yang and Yayue Deng and Yechang Huang and Yuxin Li and Yuxin Zhang and Zhao You and Brian Li and Changyi Wan and Hanpeng Hu and Jiangjie Zhen and Siyu Chen and Song Yuan and Xuelin Zhang and Yimin Jiang and Yu Zhou and Yuxiang Yang and Bingxin Li and Buyun Ma and Changhe Song and Dongqing Pang and Guoqiang Hu and Haiyang Sun and Kang An and Na Wang and Shuli Gao and Wei Ji and Wen Li and Wen Sun and Xuan Wen and Yong Ren and Yuankai Ma and Yufan Lu and Bin Wang and Bo Li and Changxin Miao and Che Liu and Chen Xu and Dapeng Shi and Dingyuan Hu and Donghang Wu and Enle Liu and Guanzhe Huang and Gulin Yan and Han Zhang and Hao Nie and Haonan Jia and Hongyu Zhou and Jianjian Sun and Jiaoren Wu and Jie Wu and Jie Yang and Jin Yang and Junzhe Lin and Kaixiang Li and Lei Yang and Liying Shi and Li Zhou and Longlong Gu and Ming Li and Mingliang Li and Mingxiao Li and Nan Wu and Qi Han and Qinyuan Tan and Shaoliang Pang and Shengjie Fan and Siqi Liu and Tiancheng Cao and Wanying Lu and Wenqing He and Wuxun Xie and Xu Zhao and Xueqi Li and Yanbo Yu and Yang Yang and Yi Liu and Yifan Lu and Yilei Wang and Yuanhao Ding and Yuanwei Liang and Yuanwei Lu and Yuchu Luo and Yuhe Yin and Yumeng Zhan and Yuxiang Zhang and Zidong Yang and Zixin Zhang and Binxing Jiao and Daxin Jiang and Heung-Yeung Shum and Jiansheng Chen and Jing Li and Xiangyu Zhang and Yibo Zhu},
      year={2025},
      eprint={2507.16632},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.16632},
}