o
openharmony-models/Qwen2.5-Omni-7B
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

Qwen2.5-Omni

Chat

概述

简介

Qwen2.5-Omni 是一款端到端多模态模型,旨在感知文本、图像、音频和视频等多种模态,同时以流式方式生成文本和自然语音响应。

核心特性

  • 全能且创新的架构:我们提出了 Thinker-Talker 架构,这是一种端到端多模态模型,旨在感知文本、图像、音频和视频等多种模态,同时以流式方式生成文本和自然语音响应。我们提出了一种新颖的位置嵌入,名为 TMRoPE(时间对齐多模态 RoPE),用于同步视频输入和音频的时间戳。

  • 实时语音与视频聊天:架构专为全实时交互设计,支持分块输入和即时输出。

  • 自然且稳健的语音生成:超越了许多现有的流式和非流式方案,在语音生成的稳健性和自然度方面表现卓越。

  • 跨模态的强劲性能:与同等规模的单模态模型相比,在所有模态上均展现出优异性能。Qwen2.5-Omni 在音频能力上超越了同等规模的 Qwen2-Audio,并达到了与 Qwen2.5-VL-7B 相当的性能水平。

  • 出色的端到端语音指令遵循能力:Qwen2.5-Omni 在端到端语音指令遵循方面的性能可与其文本输入的效果相媲美,MMLU 和 GSM8K 等基准测试结果也证明了这一点。

模型架构

性能表现

我们对Qwen2.5-Omni进行了全面评估,结果表明,与同等规模的单模态模型以及Qwen2.5-VL-7B、Qwen2-Audio、Gemini-1.5-pro等闭源模型相比,该模型在所有模态上均展现出强劲性能。在OmniBench等需要多模态融合的任务中,Qwen2.5-Omni更是达到了当前最佳水平。此外,在单模态任务中,它在语音识别(Common Voice)、翻译(CoVoST2)、音频理解(MMAU)、图像推理(MMMU、MMStar)、视频理解(MVBench)及语音生成(Seed-tts-eval和主观自然度)等领域均表现卓越。

多模态 -> 文本
数据集模型性能
OmniBench
语音 | 声音事件 | 音乐 | 平均值
Gemini-1.5-Pro42.67%|42.26%|46.23%|42.91%
MIO-Instruct36.96%|33.58%|11.32%|33.80%
AnyGPT (7B)17.77%|20.75%|13.21%|18.04%
video-SALMONN34.11%|31.70%|56.60%|35.64%
UnifiedIO2-xlarge39.56%|36.98%|29.25%|38.00%
UnifiedIO2-xxlarge34.24%|36.98%|24.53%|33.98%
MiniCPM-o-|-|-|40.50%
Baichuan-Omni-1.5-|-|-|42.90%
Qwen2.5-Omni-3B52.14%|52.08%|52.83%|52.19%
Qwen2.5-Omni-7B55.25%|60.00%|52.83%|56.13%
音频 -> 文本
数据集模型性能
语音识别(ASR)
Librispeech
dev-clean | dev-other | test-clean | test-other
SALMONN-|-|2.1|4.9
SpeechVerse-|-|2.1|4.4
Whisper-large-v3-|-|1.8|3.6
Llama-3-8B-|-|-|3.4
Llama-3-70B-|-|-|3.1
Seed-ASR-Multilingual-|-|1.6|2.8
MiniCPM-o-|-|1.7|-
MinMo-|-|1.7|3.9
Qwen-Audio1.8|4.0|2.0|4.2
Qwen2-Audio1.3|3.4|1.6|3.6
Qwen2.5-Omni-3B2.0|4.1|2.2|4.5
Qwen2.5-Omni-7B1.6|3.5|1.8|3.4
Common Voice 15
英语 | 中文 | 粤语 | 法语
Whisper-large-v39.3|12.8|10.9|10.8
MinMo7.9|6.3|6.4|8.5
Qwen2-Audio8.6|6.9|5.9|9.6
Qwen2.5-Omni-3B9.1|6.0|11.6|9.6
Qwen2.5-Omni-7B7.6|5.2|7.3|7.5
Fleurs
中文 | 英语
Whisper-large-v37.7|4.1
Seed-ASR-Multilingual-|3.4
Megrez-3B-Omni10.8|-
MiniCPM-o4.4|-
MinMo3.0|3.8
Qwen2-Audio7.5|-
Qwen2.5-Omni-3B3.2|5.4
Qwen2.5-Omni-7B3.0|4.1
Wenetspeech
test-net | test-meeting
Seed-ASR-Chinese4.7|5.7
Megrez-3B-Omni-|16.4
MiniCPM-o6.9|-
MinMo6.8|7.4
Qwen2.5-Omni-3B6.3|8.1
Qwen2.5-Omni-7B5.9|7.7
Voxpopuli-V1.0-enLlama-3-8B6.2
Llama-3-70B5.7
Qwen2.5-Omni-3B6.6
Qwen2.5-Omni-7B5.8
语音到文本翻译(S2TT)
CoVoST2
英语-德语 | 德语-英语 | 英语-中文 | 中文-英语
SALMONN18.6|-|33.1|-
SpeechLLaMA-|27.1|-|12.3
BLSP14.1|-|-|-
MiniCPM-o-|-|48.2|27.2
MinMo-|39.9|46.7|26.0
Qwen-Audio25.1|33.9|41.5|15.7
Qwen2-Audio29.9|35.2|45.2|24.4
Qwen2.5-Omni-3B28.3|38.1|41.4|26.6
Qwen2.5-Omni-7B30.2|37.7|41.4|29.4
情感识别(SER)
MeldWavLM-large0.542
MiniCPM-o0.524
Qwen-Audio0.557
Qwen2-Audio0.553
Qwen2.5-Omni-3B0.558
Qwen2.5-Omni-7B0.570
声音分类(VSC)
VocalSoundCLAP0.495
Pengi0.604
Qwen-Audio0.929
Qwen2-Audio0.939
Qwen2.5-Omni-3B0.936
Qwen2.5-Omni-7B0.939
音乐
GiantSteps TempoLlark-7B0.86
Qwen2.5-Omni-3B0.88
Qwen2.5-Omni-7B0.88
MusicCapsLP-MusicCaps0.291|0.149|0.089|0.061|0.129|0.130
Qwen2.5-Omni-3B0.325|0.163|0.093|0.057|0.132|0.229
Qwen2.5-Omni-7B0.328|0.162|0.090|0.055|0.127|0.225
音频推理
MMAU
声音 | 音乐 | 语音 | 平均值
Gemini-Pro-V1.556.75|49.40|58.55|54.90
Qwen2-Audio54.95|50.98|42.04|49.20
Qwen2.5-Omni-3B70.27|60.48|59.16|63.30
Qwen2.5-Omni-7B67.87|69.16|59.76|65.60
语音对话
VoiceBench
AlpacaEval | CommonEval | SD-QA | MMSU
Ultravox-v0.4.1-LLaMA-3.1-8B4.55|3.90|53.35|47.17
MERaLiON4.50|3.77|55.06|34.95
Megrez-3B-Omni3.50|2.95|25.95|27.03
Lyra-Base3.85|3.50|38.25|49.74
MiniCPM-o4.42|4.15|50.72|54.78
Baichuan-Omni-1.54.50|4.05|43.40|57.25
Qwen2-Audio3.74|3.43|35.71|35.72
Qwen2.5-Omni-3B4.32|4.00|49.37|50.23
Qwen2.5-Omni-7B4.49|3.93|55.71|61.32
VoiceBench
OpenBookQA | IFEval | AdvBench | 平均值
Ultravox-v0.4.1-LLaMA-3.1-8B65.27|66.88|98.46|71.45
MERaLiON27.23|62.93|94.81|62.91
Megrez-3B-Omni28.35|25.71|87.69|46.25
Lyra-Base72.75|36.28|59.62|57.66
MiniCPM-o78.02|49.25|97.69|71.69
Baichuan-Omni-1.574.51|54.54|97.31|71.14
Qwen2-Audio49.45|26.33|96.73|55.35
Qwen2.5-Omni-3B74.73|42.10|98.85|68.81
Qwen2.5-Omni-7B81.10|52.87|99.42|74.12
图像 -> 文本
数据集Qwen2.5-Omni-7BQwen2.5-Omni-3B其他最佳Qwen2.5-VL-7BGPT-4o-mini
MMMUval59.253.153.958.660.0
MMMU-Prooverall36.629.7-38.337.6
MathVistatestmini67.959.471.968.252.5
MathVisionfull25.020.823.125.1-
MMBench-V1.1-ENtest81.877.880.582.676.0
MMVetturbo66.862.167.567.166.9
MMStar64.055.764.063.954.8
MMEsum23402117237223472003
MuirBench59.248.0-59.2-
CRPErelation76.573.7-76.4-
RealWorldQAavg70.362.671.968.5-
MME-RealWorlden61.655.6-57.4-
MM-MT-Bench6.05.0-6.3-
AI2D83.279.585.883.9-
TextVQAval84.479.883.284.9-
DocVQAtest95.293.393.595.7-
ChartQAtest Avg85.382.884.987.3-
OCRBench_V2en57.851.7-56.3-
数据集Qwen2.5-Omni-7BQwen2.5-Omni-3BQwen2.5-VL-7BGrounding DINOGemini 1.5 Pro
Refcocoval90.588.790.090.673.2
RefcocotextA93.591.892.593.272.9
RefcocotextB86.684.085.488.274.6
Refcoco+val85.481.184.288.262.5
Refcoco+textA91.087.589.189.063.9
Refcoco+textB79.373.276.975.965.0
Refcocog+val87.485.087.286.175.2
Refcocog+test87.985.187.287.076.2
ODinW42.439.237.355.036.7
PointGrounding66.546.267.3--
视频(无音频) -> 文本
数据集Qwen2.5-Omni-7BQwen2.5-Omni-3B其他最佳Qwen2.5-VL-7BGPT-4o-mini
Video-MMEw/o sub64.362.063.965.164.8
Video-MMEw sub72.468.667.971.6-
MVBench70.368.767.269.6-
EgoSchematest68.661.463.265.0-
零样本语音生成
数据集模型性能
内容一致性
SEED
test-zh | test-en | test-hard
Seed-TTS_ICL1.11 | 2.24 | 7.58
Seed-TTS_RL1.00 | 1.94 | 6.42
MaskGCT2.27 | 2.62 | 10.27
E2_TTS1.97 | 2.19 | -
F5-TTS1.56 | 1.83 | 8.67
CosyVoice 21.45 | 2.57 | 6.83
CosyVoice 2-S1.45 | 2.38 | 8.08
Qwen2.5-Omni-3B_ICL1.95 | 2.87 | 9.92
Qwen2.5-Omni-3B_RL1.58 | 2.51 | 7.86
Qwen2.5-Omni-7B_ICL1.70 | 2.72 | 7.97
Qwen2.5-Omni-7B_RL1.42 | 2.32 | 6.54
说话人相似度
SEED
test-zh | test-en | test-hard
Seed-TTS_ICL0.796 | 0.762 | 0.776
Seed-TTS_RL0.801 | 0.766 | 0.782
MaskGCT0.774 | 0.714 | 0.748
E2_TTS0.730 | 0.710 | -
F5-TTS0.741 | 0.647 | 0.713
CosyVoice 20.748 | 0.652 | 0.724
CosyVoice 2-S0.753 | 0.654 | 0.732
Qwen2.5-Omni-3B_ICL0.741 | 0.635 | 0.748
Qwen2.5-Omni-3B_RL0.744 | 0.635 | 0.746
Qwen2.5-Omni-7B_ICL0.752 | 0.632 | 0.747
Qwen2.5-Omni-RL0.754 | 0.641 | 0.752
文本 -> 文本
数据集Qwen2.5-Omni-7BQwen2.5-Omni-3BQwen2.5-7BQwen2.5-3BQwen2-7BLlama3.1-8BGemma2-9B
MMLU-Pro47.040.456.343.744.148.352.1
MMLU-redux71.060.975.46467.367.272.8
LiveBench083129.622.335.926.829.226.730.6
GPQA30.834.3.30.334.332.832.8
MATH71.563.675.565.952.951.944.3
GSM8K88.782.691.686.785.784.576.7
HumanEval78.770.784.874.479.972.668.9
MBPP73.270.479.272.767.269.674.9
MultiPL-E65.870.460.250.7
LiveCodeBench2305-240924.616.528.719.923.98.318.9

快速入门

以下为您提供使用 🤗 Transformers 调用 Qwen2.5-Omni 的简单示例。Qwen2.5-Omni 的代码已集成到最新版 Hugging Face Transformers 中,建议您通过以下命令从源码构建:

pip uninstall transformers
pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview
pip install accelerate

或者你可能会遇到以下错误:

KeyError: 'qwen2_5_omni'

我们提供了一个工具包,可帮助您更便捷地处理各类音视频输入,操作方式如同使用API一般。这包括base64、URL以及交错的音频、图像和视频。您可以通过以下命令进行安装,并确保您的系统已安装ffmpeg:

# It's highly recommended to use `[decord]` feature for faster video loading.
pip install qwen-omni-utils[decord] -U

如果您未使用 Linux 系统,可能无法从 PyPI 安装 decord。这种情况下,您可以使用 pip install qwen-omni-utils -U,它将回退到使用 torchvision 进行视频处理。不过,您仍然可以从源码安装 decord,以便在加载视频时使用 decord。

🤗 Transformers 使用方法

以下是一个代码片段,展示如何结合 transformers 和 qwen_omni_utils 使用聊天模型:

import soundfile as sf

from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

# default: Load the model on the available device(s)
model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-Omni-7B",
#     torch_dtype="auto",
#     device_map="auto",
#     attn_implementation="flash_attention_2",
# )

processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"},
        ],
    },
]

# set use audio in video
USE_AUDIO_IN_VIDEO = True

# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)

# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO)

text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)
sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)
最低 GPU 内存要求
模型精度15秒视频30秒视频60秒视频
Qwen-Omni-3BFP3289.10 GB不推荐不推荐
Qwen-Omni-3BBF1618.38 GB22.43 GB28.22 GB
Qwen-Omni-7BFP3293.56 GB不推荐不推荐
Qwen-Omni-7BBF1631.11 GB41.85 GB60.19 GB

注意:上表展示了使用 transformers 进行推理的理论最低内存要求,其中 BF16 是在 attn_implementation="flash_attention_2" 条件下测试的;但在实际应用中,实际内存使用量通常至少高出 1.2 倍。更多信息,请参见链接资源 here。

视频 URL 资源使用

视频 URL 的兼容性很大程度上取决于第三方库的版本。具体细节如下表所示。如果您不想使用默认后端,可以通过 FORCE_QWENVL_VIDEO_READER=torchvision 或 FORCE_QWENVL_VIDEO_READER=decord 来更改后端。

后端HTTPHTTPS
torchvision >= 0.19.0✅✅
torchvision < 0.19.0❌❌
decord✅❌
批量推理

当设置 return_audio=False 时,模型可以将由文本、图像、音频和视频等各种类型的混合样本组成的输入进行批处理。以下是一个示例。

# Sample messages for batch inference

# Conversation with video only
conversation1 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "/path/to/video.mp4"},
        ]
    }
]

# Conversation with audio only
conversation2 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "/path/to/audio.wav"},
        ]
    }
]

# Conversation with pure text
conversation3 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": "who are you?"
    }
]


# Conversation with mixed media
conversation4 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "/path/to/image.jpg"},
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "audio", "audio": "/path/to/audio.wav"},
            {"type": "text", "text": "What are the elements can you see and hear in these medias?"},
        ],
    }
]

# Combine messages for batch processing
conversations = [conversation1, conversation2, conversation3, conversation4]

# set use audio in video
USE_AUDIO_IN_VIDEO = True

# Preparation for batch inference
text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)

inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)

# Batch Inference
text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, return_audio=False)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)

使用提示

音频输出提示词

若用户需要音频输出,必须将系统提示词设置为“You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.”,否则音频输出可能无法按预期工作。

{
    "role": "system",
    "content": [
        {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
    ],
}

在视频中使用音频

在多模态交互过程中,用户提供的视频往往伴随有音频(例如针对视频内容的提问,或视频中特定事件产生的声音)。这些信息有助于模型提供更优质的交互体验。因此,我们提供以下选项,供用户决定是否使用视频中的音频。

# first place, in data preprocessing
audios, images, videos = process_mm_info(conversations, use_audio_in_video=True)
# second place, in model processor
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", 
                   padding=True, use_audio_in_video=True)
#  third place, in model inference
text_ids, audio = model.generate(**inputs, use_audio_in_video=True)

需要注意的是,在多轮对话过程中,这些地方的use_audio_in_video参数必须设置为相同的值,否则可能会出现意外结果。

是否使用音频输出

该模型同时支持文本和音频输出。如果用户不需要音频输出,可以在模型初始化后调用model.disable_talker()。此选项将节省约~2GB的GPU内存,但generate函数的return_audio选项将只能设置为False。

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto"
)
model.disable_talker()

为获得灵活的使用体验,我们建议用户在调用generate函数时自行决定是否返回音频。若将return_audio设为False,模型将仅返回文本输出,从而更快地获取文本响应。

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto"
)
...
text_ids = model.generate(**inputs, return_audio=False)

更改输出音频的语音类型

Qwen2.5-Omni 支持更改输出音频的语音类型。"Qwen/Qwen2.5-Omni-7B" 模型 checkpoint 支持以下两种语音类型:

语音类型性别描述
Chelsie女甜美柔和的嗓音,带有温柔的暖意和清晰的明亮感。
Ethan男明快活泼的嗓音,充满感染力的活力和亲切温暖的氛围。

用户可通过 generate 函数的 speaker 参数指定语音类型。默认情况下,若未指定 speaker,则默认语音类型为 Chelsie。

text_ids, audio = model.generate(**inputs, speaker="Chelsie")
text_ids, audio = model.generate(**inputs, speaker="Ethan")

Flash-Attention 2 加速生成

首先,请确保安装最新版本的 Flash Attention 2:

pip install -U flash-attn --no-build-isolation

此外,您的硬件需兼容 FlashAttention 2。有关详细信息,请参阅 Flash Attention 仓库 的官方文档。FlashAttention-2 仅在模型以 torch.float16 或 torch.bfloat16 精度加载时可用。

若要使用 FlashAttention-2 加载并运行模型,请在加载模型时添加 attn_implementation="flash_attention_2":

from transformers import Qwen2_5OmniForConditionalGeneration

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

引用

如果您发现我们的论文和代码对您的研究有所帮助,请考虑给予一个星标 :star: 并引用 :pencil: :)


@article{Qwen2.5-Omni,
  title={Qwen2.5-Omni Technical Report},
  author={Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin},
  journal={arXiv preprint arXiv:2503.20215},
  year={2025}
}