Qwen3-Omni

概述

简介

Qwen3-Omni 是原生端到端的多语言全模态基础模型。它能够处理文本、图像、音频和视频，并以文本和自然语音的形式提供实时流式响应。我们引入了多项架构升级，以提升性能和效率。主要特性如下：

跨模态性能领先：采用早期文本优先预训练与混合多模态训练，实现原生多模态支持。在音频和音视频任务上表现出色的同时，单模态文本和图像性能无退化。在 36 项音频/视频基准测试中，有 22 项达到 SOTA，36 项中 32 项达到开源 SOTA；语音识别、音频理解及语音对话性能与 Gemini 2.5 Pro 相当。
多语言支持：支持 119 种文本语言、19 种语音输入语言和 10 种语音输出语言。
- 语音输入：英语、中文、韩语、日语、德语、俄语、意大利语、法语、西班牙语、葡萄牙语、马来语、荷兰语、印尼语、土耳其语、越南语、粤语、阿拉伯语、乌尔都语。
- 语音输出：英语、中文、法语、德语、俄语、意大利语、西班牙语、葡萄牙语、日语、韩语。
创新架构：基于 MoE 的 Thinker-Talker 设计，结合 AuT 预训练实现强大的通用表征，并采用多码本设计将延迟降至最低。
实时音视频交互：低延迟流式处理，支持自然的对话轮替，可即时生成文本或语音响应。
灵活可控：通过系统提示词自定义模型行为，实现精细化控制和便捷适配。
精细化音频描述生成：Qwen3-Omni-30B-A3B-Captioner 现已开源：这是一款通用、高度详细且低幻觉的音频描述生成模型，填补了开源社区的关键空白。

模型架构

使用场景示例指南

Qwen3-Omni 支持丰富的多模态应用场景，涵盖涉及音频、图像、视频及音视频模态的各类领域任务。以下是多个展示 Qwen3-Omni 使用场景的示例指南，其中包含实际运行日志。您可以先参考快速开始指南下载模型并安装必要的推理环境依赖，然后在本地运行和实验——尝试修改提示词或切换模型类型，探索 Qwen3-Omni 的强大能力！

类别	示例指南	描述
音频	语音识别	语音识别，支持多语言及长音频。
	语音翻译	语音转文本/语音转语音翻译。
	音乐分析	对任意音乐进行详细分析与鉴赏，包括风格、流派、节奏等。
	声音分析	对各类音效及音频信号进行描述与分析。
	音频描述	音频内容生成描述，对任意音频输入进行详细说明。
	混合音频分析	对混合音频内容进行分析，例如包含语音、音乐及环境音的音频。
视觉	OCR	复杂图像的光学字符识别。
	目标定位	目标检测与定位。
	图像问答	回答关于任意图像的各类问题。
	图像数学	解决图像中的复杂数学问题，突出展示思维模型的能力。
	视频描述	对视频内容进行详细描述。
	视频导航	根据第一人称运动视频生成导航指令。
	视频场景转换	分析视频中的场景转换。
音视频	音视频问答	在音视频场景下回答各类问题，展示模型对音视频时间对齐的建模能力。
	音视频交互	通过音视频输入与模型进行交互式沟通，包括通过音频指定任务。
	音视频对话	通过音视频输入与模型进行对话交互，展示其在日常聊天和助手类任务中的能力。
智能体	音频函数调用	通过音频输入执行函数调用，实现智能体类行为。
下游任务微调	全模态描述生成器	Qwen3-Omni-30B-A3B-Captioner 的介绍与能力展示，该模型是基于 Qwen3-Omni-30B-A3B-Instruct 微调的下游模型，体现了 Qwen3-Omni 基础模型的强大泛化能力。

快速开始

模型说明与下载

以下是所有Qwen3-Omni模型的说明，请选择并下载符合您需求的模型。

模型名称	说明
Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-30B-A3B的指令微调模型，包含思考器（thinker）和表达器（talker），支持音频、视频和文本输入，输出音频和文本。更多信息请阅读Qwen3-Omni技术报告。
Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-30B-A3B的思考模型，包含思考器（thinker）组件，具备思维链推理能力，支持音频、视频和文本输入，输出文本。更多信息请阅读Qwen3-Omni技术报告。
Qwen3-Omni-30B-A3B-Captioner	基于Qwen3-Omni-30B-A3B-Instruct微调的下游音频细粒度描述模型，可为任意音频输入生成详细且低幻觉的描述文本。该模型包含思考器（thinker），支持音频输入和文本输出。更多信息可参考模型的使用指南。

在Hugging Face Transformers或vLLM中加载模型时，会根据模型名称自动下载模型权重。但如果您的运行环境不利于在执行过程中下载权重，可参考以下命令手动将模型权重下载到本地目录：

# Download through ModelScope (recommended for users in Mainland China)
pip install -U modelscope
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Instruct --local_dir ./Qwen3-Omni-30B-A3B-Instruct
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Thinking --local_dir ./Qwen3-Omni-30B-A3B-Thinking
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Captioner --local_dir ./Qwen3-Omni-30B-A3B-Captioner

# Download through Hugging Face
pip install -U "huggingface_hub[cli]"
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./Qwen3-Omni-30B-A3B-Instruct
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Thinking --local-dir ./Qwen3-Omni-30B-A3B-Thinking
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Captioner --local-dir ./Qwen3-Omni-30B-A3B-Captioner

Transformers 使用方法

安装

Qwen3-Omni 的 Hugging Face Transformers 代码已成功合并，但 PyPI 包尚未发布。因此，您需要通过以下命令从源代码安装。我们强烈建议您创建一个新的 Python 环境，以避免环境运行时问题。

# If you already have transformers installed, please uninstall it first, or create a new Python environment
# pip uninstall transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate

我们提供了一个工具包，可帮助您更便捷地处理各类音视频输入，提供类 API 的使用体验。这包括对 base64、URL 以及交错的音频、图像和视频的支持。您可以使用以下命令进行安装，并确保您的系统已安装 ffmpeg：

pip install qwen-omni-utils -U

此外，我们建议在使用 Hugging Face Transformers 运行时使用 FlashAttention 2，以减少 GPU 内存占用。不过，如果您主要使用 vLLM 进行推理，则无需安装此组件，因为 vLLM 默认已包含 FlashAttention 2。

pip install -U flash-attn --no-build-isolation

此外，您需要配备与 FlashAttention 2 兼容的硬件。有关更多信息，请参阅 FlashAttention 代码库的官方文档。FlashAttention 2 仅在模型以 torch.float16 或 torch.bfloat16 精度加载时方可使用。

代码片段

以下是使用 transformers 和 qwen_omni_utils 调用 Qwen3-Omni 的代码片段：

import soundfile as sf

from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
# MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)

processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
            {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
            {"type": "text", "text": "What can you see and hear? Answer in one short sentence."}
        ],
    },
]

# Set whether to use audio in video
USE_AUDIO_IN_VIDEO = True

# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, 
                   audio=audios, 
                   images=images, 
                   videos=videos, 
                   return_tensors="pt", 
                   padding=True, 
                   use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)

# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, 
                                 speaker="Ethan", 
                                 thinker_return_dict_in_generate=True,
                                 use_audio_in_video=USE_AUDIO_IN_VIDEO)

text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :],
                              skip_special_tokens=True,
                              clean_up_tokenization_spaces=False)
print(text)
if audio is not None:
    sf.write(
        "output.wav",
        audio.reshape(-1).detach().cpu().numpy(),
        samplerate=24000,
    )

以下是一些更高级的使用示例。您可以展开下方的部分以了解更多信息。

批量推理

当设置 return_audio=False 时，模型可以将由文本、图像、音频和视频等多种类型混合样本组成的输入进行批处理。以下是一个示例。

from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
# MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
model.disable_talker()

processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

# Conversation with image only
conversation1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
            {"type": "text", "text": "What can you see in this image? Answer in one sentence."},
        ]
    }
]

# Conversation with audio only
conversation2 = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
            {"type": "text", "text": "What can you hear in this audio?"},
        ]
    }
]

# Conversation with pure text and system prompt
conversation3 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen-Omni."}
        ],
    },
    {
        "role": "user",
        "content": "Who are you?"
    }
]

# Conversation with mixed media
conversation4 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
            {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
            {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
        ],
    }
]

# Combine messages for batch processing
conversations = [conversation1, conversation2, conversation3, conversation4]

# Set whether to use audio in video
USE_AUDIO_IN_VIDEO = True

# Preparation for batch inference
text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)

inputs = processor(text=text, 
                   audio=audios, 
                   images=images, 
                   videos=videos, 
                   return_tensors="pt", 
                   padding=True, 
                   use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)

# Batch inference does not support returning audio
text_ids, audio = model.generate(**inputs,
                                 return_audio=False,
                                 thinker_return_dict_in_generate=True,
                                 use_audio_in_video=USE_AUDIO_IN_VIDEO)

text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :],
                              skip_special_tokens=True,
                              clean_up_tokenization_spaces=False)
print(text)

是否使用音频输出

该模型同时支持文本和音频输出。如果用户不需要音频输出，可以在初始化模型后调用 model.disable_talker()。此选项将节省约 10GB 的 GPU 内存，但 generate 函数的 return_audio 参数只能设置为 False。

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
model.disable_talker()

为了获得更灵活的使用体验，我们建议用户在调用 generate 函数时自行决定是否返回音频。若将 return_audio 设置为 False，模型将仅返回文本输出，从而加快文本响应速度。

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
...
text_ids, _ = model.generate(..., return_audio=False)```

</details>

<details>
<summary>Change voice type of output audio</summary>

Qwen3-Omni supports changing the voice of the output audio. The `"Qwen/Qwen3-Omni-30B-A3B-Instruct"` checkpoint supports three voice types as follows:

| Voice Type | Gender | Description |
|------------|--------|-------------|
| Ethan      | Male   | A bright, upbeat voice with infectious energy and a warm, approachable vibe. |
| Chelsie    | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity. |
| Aiden      | Male   | A warm, laid-back American voice with a gentle, boyish charm. |

Users can use the `speaker` parameter of the `generate` function to specify the voice type. By default, if `speaker` is not specified, the voice type is `Ethan`.

```python
text_ids, audio = model.generate(..., speaker="Ethan")

text_ids, audio = model.generate(..., speaker="Chelsie")

text_ids, audio = model.generate(..., speaker="Aiden")

vLLM 使用方法

安装

我们强烈建议使用 vLLM 进行 Qwen3-Omni 系列模型的推理和部署。由于我们的代码目前处于拉取请求阶段，且Instruct 模型的音频输出推理支持将在近期发布，您可以按照以下命令从源代码安装 vLLM。请注意，我们建议您创建一个新的 Python 环境，以避免运行时环境冲突和不兼容问题。有关从源代码编译 vLLM 的更多详细信息，请参考 vLLM 官方文档。

git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
# Install the Transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

推理

您可以使用以下代码进行 vLLM 推理。limit_mm_per_prompt 参数用于指定每条消息允许的每种模态数据的最大数量。由于 vLLM 需要预先分配 GPU 内存，该值越大，所需的 GPU 内存就越多；如果出现内存溢出（OOM）问题，请尝试减小此值。将 tensor_parallel_size 设置为大于 1 的值可启用多 GPU 并行推理，从而提高并发性和吞吐量。此外，max_num_seqs 表示 vLLM 在每个推理步骤中并行处理的序列数量。该值越大，所需的 GPU 内存越多，但能实现更高的批量推理速度。有关更多详细信息，请参阅 vLLM 官方文档。以下是使用 vLLM 运行 Qwen3-Omni 的简单示例：

import os
import torch

from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

if __name__ == '__main__':
    # vLLM engine v1 not supported yet
    os.environ['VLLM_USE_V1'] = '0'

    MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
    # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

    llm = LLM(
            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
            tensor_parallel_size=torch.cuda.device_count(),
            limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
            max_num_seqs=8,
            max_model_len=32768,
            seed=1234,
    )

    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        max_tokens=16384,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}
            ], 
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)

    inputs = {
        'prompt': text,
        'multi_modal_data': {},
        "mm_processor_kwargs": {
            "use_audio_in_video": True,
        },
    }

    if images is not None:
        inputs['multi_modal_data']['image'] = images
    if videos is not None:
        inputs['multi_modal_data']['video'] = videos
    if audios is not None:
        inputs['multi_modal_data']['audio'] = audios

    outputs = llm.generate([inputs], sampling_params=sampling_params)

    print(outputs[0].outputs[0].text)

以下是一些更高级的使用示例。你可以展开下方的章节了解更多信息。

批量推理

使用 vLLM 可以实现快速的批量推理，这能帮助你高效处理大量数据或进行基准测试。参考以下代码示例：

import os
import torch

from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

def build_input(processor, messages, use_audio_in_video):
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    audios, images, videos = process_mm_info(messages, use_audio_in_video=use_audio_in_video)

    inputs = {
        'prompt': text,
        'multi_modal_data': {},
        "mm_processor_kwargs": {
            "use_audio_in_video": use_audio_in_video,
        },
    }

    if images is not None:
        inputs['multi_modal_data']['image'] = images
    if videos is not None:
        inputs['multi_modal_data']['video'] = videos
    if audios is not None:
        inputs['multi_modal_data']['audio'] = audios
    
    return inputs

if __name__ == '__main__':
    # vLLM engine v1 not supported yet
    os.environ['VLLM_USE_V1'] = '0'

    MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
    # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

    llm = LLM(
            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
            tensor_parallel_size=torch.cuda.device_count(),
            limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
            max_num_seqs=8,
            max_model_len=32768,
            seed=1234,
    )

    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        max_tokens=16384,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

    # Conversation with image only
    conversation1 = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
                {"type": "text", "text": "What can you see in this image? Answer in one sentence."},
            ]
        }
    ]

    # Conversation with audio only
    conversation2 = [
        {
            "role": "user",
            "content": [
                {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
                {"type": "text", "text": "What can you hear in this audio?"},
            ]
        }
    ]

    # Conversation with pure text and system prompt
    conversation3 = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are Qwen-Omni."}
            ],
        },
        {
            "role": "user",
            "content": "Who are you? Answer in one sentence."
        }
    ]

    # Conversation with mixed media
    conversation4 = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
                {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/cookbook/asr_fr.wav"},
                {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
            ],
        }
    ]
    
    USE_AUDIO_IN_VIDEO = True

    # Combine messages for batch processing
    conversations = [conversation1, conversation2, conversation3, conversation4]
    inputs = [build_input(processor, messages, USE_AUDIO_IN_VIDEO) for messages in conversations]

    outputs = llm.generate(inputs, sampling_params=sampling_params)

    result = [outputs[i].outputs[0].text for i in range(len(outputs))]
    print(result)

vLLM 服务使用方法

目前，Qwen3-Omni 的 vLLM 服务仅支持思考者模型。vLLM 服务中不提供 use_audio_in_video 参数；您可以通过分别传入视频和音频输入来进行处理。您可以通过以下命令启动 vLLM 服务：

# Qwen3-Omni-30B-A3B-Instruct for single GPU
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 32768 --allowed-local-media-path / -tp 1
# Qwen3-Omni-30B-A3B-Instruct for multi-GPU (example on 4 GPUs)
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 65536 --allowed-local-media-path / -tp 4
# Qwen/Qwen3-Omni-30B-A3B-Thinking for single GPU
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 32768 --allowed-local-media-path / -tp 1
# Qwen/Qwen3-Omni-30B-A3B-Thinking for multi-GPU (example on 4 GPUs)
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 65536 --allowed-local-media-path / -tp 4

然后您可以按如下方式使用聊天 API（例如通过 curl）：

curl http://localhost:8901/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}},
        {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}},
        {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
    ]}
    ]
    }'

使用提示（建议阅读）

最低 GPU 内存要求

模型	精度	15秒视频	30秒视频	60秒视频	120秒视频
Qwen3-Omni-30B-A3B-Instruct	BF16	78.85 GB	88.52 GB	107.74 GB	144.81 GB
Qwen3-Omni-30B-A3B-Thinking	BF16	68.74 GB	77.79 GB	95.76 GB	131.65 GB

注意：上表展示了使用 transformers 和 BF16 精度进行推理的理论最低内存要求，测试时采用 attn_implementation="flash_attention_2"。Instruct 模型同时包含思考器（thinker） 和讲述器（talker） 组件，而 Thinking 模型仅包含思考器（thinker） 部分。

音视频交互提示词

当使用 Qwen3-Omni 进行音视频多模态交互时（输入包含视频及其对应的音频，其中音频作为查询），我们建议使用以下系统提示词。此设置有助于模型在保持高推理能力的同时，更好地承担智能助手等交互角色。此外，思考器生成的文本将更具可读性，语气自然、口语化，且不会包含难以 vocalize 的复杂格式，从而使讲述器输出的音频更稳定流畅。您可以根据需要自定义系统提示词中的 user_system_prompt 字段，以包含角色设定或其他特定描述。

user_system_prompt = "You are Qwen-Omni, a smart voice assistant created by Alibaba Qwen."
message = {
    "role": "system",
    "content": [
          {"type": "text", "text": f"{user_system_prompt} You are a virtual voice assistant with no gender or age.\nYou are communicating with the user.\nIn user messages, “I/me/my/we/our” refer to the user and “you/your” refer to the assistant. In your replies, address the user as “you/your” and yourself as “I/me/my”; never mirror the user’s pronouns—always shift perspective. Keep original pronouns only in direct quotes; if a reference is unclear, ask a brief clarifying question.\nInteract with users using short(no more than 50 words), brief, straightforward language, maintaining a natural tone.\nNever use formal phrasing, mechanical expressions, bullet points, overly structured language. \nYour output must consist only of the spoken content you want the user to hear. \nDo not include any descriptions of actions, emotions, sounds, or voice changes. \nDo not use asterisks, brackets, parentheses, or any other symbols to indicate tone or actions. \nYou must answer users' audio or text questions, do not directly describe the video content. \nYou should communicate in the same language strictly as the user unless they request otherwise.\nWhen you are uncertain (e.g., you can't see/hear clearly, don't understand, or the user makes a comment rather than asking a question), use appropriate questions to guide the user to continue the conversation.\nKeep replies concise and conversational, as if talking face-to-face."}
    ]
}

思维模型最佳实践

Qwen3-Omni-30B-A3B-Thinking模型主要用于理解和处理文本、音频、图像、视频等多模态输入。为获得最佳性能，建议用户在每轮对话中，除多模态输入外，同时提供明确的文本指令或任务描述。这有助于清晰传达意图，并显著提升模型运用推理能力的效果。例如：

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "/path/to/audio.wav"},
            {"type": "image", "image": "/path/to/image.png"},
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "Analyze this audio, image, and video together."},
        ], 
    }
]

在视频中使用音频

在多模态交互中，用户提供的视频通常会附带音频（例如口头提问或视频中事件产生的声音）。这些信息有助于模型提供更优质的交互体验。我们提供以下选项，供用户决定是否使用视频中的音频。

# In data preprocessing
audios, images, videos = process_mm_info(messages, use_audio_in_video=True)

# For Transformers
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", 
                   padding=True, use_audio_in_video=True)
text_ids, audio = model.generate(..., use_audio_in_video=True)

# For vLLM
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = {
    'prompt': text,
    'multi_modal_data': {},
    "mm_processor_kwargs": {
        "use_audio_in_video": True,
    },
}

需要注意的是，在多轮对话过程中，use_audio_in_video参数在各步骤中必须保持一致设置，否则可能会出现意外结果。

评估

Qwen3-Omni 的性能表现

Qwen3-Omni 在文本和视觉模态上保持了最先进的性能，与同等规模的单模型 Qwen 相比未出现性能下降。在 36 项音频及音视频基准测试中，它在 32 项上达到了开源领域的最先进水平（SOTA），并在 22 项上刷新了 SOTA 纪录，性能超越了 Gemini 2.5 Pro 和 GPT-4o 等强大的闭源系统。

文本到文本

		GPT-4o-0327	Qwen3-235B-A22B Non Thinking	Qwen3-30B-A3B-Instruct-2507	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
通用任务	MMLU-Redux	91.3	89.2	89.3	86.6	86.8
通用任务	GPQA	66.9	62.9	70.4	69.6	69.7
推理	AIME25	26.7	24.7	61.3	65.0	65.9
推理	ZebraLogic	52.6	37.7	90.0	76.0	76.1
代码	MultiPL-E	82.7	79.3	83.8	81.4	81.5
对齐任务	IFEval	83.9	83.2	84.7	81.0	81.7
	Creative Writing v3	84.9	80.4	86.0	80.6	81.8
	WritingBench	75.5	77.0	85.5	82.6	83.0
智能体	BFCL-v3	66.5	68.0	65.1	64.4	65.0
多语言任务	MultiIF	70.4	70.2	67.9	64.0	64.7
多语言任务	PolyMATH	25.5	27.0	43.1	37.9	39.3

		Gemini-2.5-Flash Thinking	Qwen3-235B-A22B Thinking	Qwen3-30B-A3B-Thinking-2507	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Thinking
通用任务	MMLU-Redux	92.1	92.7	91.4	88.8	89.7
通用任务	GPQA	82.8	71.1	73.4	73.1	73.1
推理	AIME25	72.0	81.5	85.0	73.7	74.0
推理	LiveBench 20241125	74.3	77.1	76.8	71.8	70.3
代码	MultiPL-E	84.5	79.9	81.3	80.6	81.0
对齐任务	IFEval	89.8	83.4	88.9	85.1	85.2
	Arena-Hard v2	56.7	61.5	56.0	55.1	57.8
	Creative Writing v3	85.0	84.6	84.4	82.5	83.6
	WritingBench	83.9	80.3	85.0	85.5	85.9
智能体	BFCL-v3	68.6	70.8	72.4	63.2	64.5
多语言任务	MultiIF	74.4	71.9	76.4	72.9	73.2
多语言任务	PolyMATH	49.8	54.7	52.6	47.1	48.7

音频到文本

	Seed-ASR	Voxtral-Mini	Voxtral-Small	GPT-4o-Transcribe	Gemini-2.5-Pro	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
中英文语音识别（词错误率）
Wenetspeech 网络 \| 会议	4.66 \| 5.69	24.30 \| 31.53	20.33 \| 26.08	15.30 \| 32.27	14.43 \| 13.47	5.91 \| 7.65	4.69 \| 5.89	4.62 \| 5.75
Librispeech 清晰 \| 其他	1.58 \| 2.84	1.88 \| 4.12	1.56 \| 3.30	1.39 \| 3.75

	GPT-4o-Audio	Gemini-2.5-Flash	Gemini-2.5-Pro	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Instruct	Qwen3-Omni-Flash-Thinking
VoiceBench
AlpacaEval	95.6	96.1	94.3	89.9	94.8	96.4	95.4	96.8
CommonEval	89.8	88.3	88.4	76.7	90.8	90.5	91.0	90.9
WildVoice	91.6	92.1	93.4	77.7	91.6	90.5	92.3	90.9
SD-QA	75.5	84.5	90.1	56.4	76.9	78.1	76.8	78.5
MMSU	80.3	66.1	71.1	61.7	68.1	83.0	68.4	84.3
OpenBookQA	89.2	56.9	92.3	80.9	89.7	94.3	91.4	95.0
BBH	84.1	83.9	92.6	66.7	80.4	88.9	80.6	89.6
IFEval	76.0	83.8	85.7	53.5	77.8	80.6	75.2	80.8
AdvBench	98.7	98.9	98.1	99.2	99.3	97.2	99.4	98.9
Overall	86.8	83.4	89.6	73.6	85.5	88.8	85.6	89.5
Audio Reasoning
MMAU-v05.15.25	62.5	71.8	77.4	65.5	77.5	75.4	77.6	76.5
MMSU	56.4	70.2	77.7	62.6	69.0	70.2	69.1	71.3

	Best Specialist Models	GPT-4o-Audio	Gemini-2.5-Pro	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
RUL-MuchoMusic	47.6 (Audio Flamingo 3)	36.1	49.4	47.3	52.0	52.1
GTZAN Acc.	87.9 (CLaMP 3)	76.5	81.0	81.7	93.0	93.1
MTG Genre Micro F1	35.8 (MuQ-MuLan)	25.3	32.6	32.5	39.0	39.5
MTG Mood/Theme Micro F1	10.9 (MuQ-MuLan)	11.3	14.1	8.9	21.0	21.7
MTG Instrument Micro F1	39.8 (MuQ-MuLan)	34.2	33.0	22.6	40.5	40.7
MTG Top50 Micro F1	33.2 (MuQ-MuLan)

数据集	Gemini-2.5-flash-thinking	InternVL-3.5-241B-A28B	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Thinking
通用视觉问答
MMStar	75.5	77.9	74.9	75.5
HallusionBench	61.1	57.3	62.8	63.4
MM-MT-Bench	7.8	–	8.0	8.0
数学与科学工程
MMMU_val	76.9	77.7	75.6	75.0
MMMU_pro	65.8	–	60.5	60.8
MathVista_mini	77.6	82.7	80.0	81.2
MathVision_full	62.3	63.9	62.9	63.8
文档理解
AI2D_test	88.6	87.3	86.1	86.8
ChartQA_test	–	88.0	89.5	89.3
计数
CountBench	88.6	–	88.6	92.5
视频理解
Video-MME	79.6	72.9	69.7	69.8
LVBench	64.5	–	49.0	49.5
MLVU	82.1	78.2	72.9	73.9

视听转文本

数据集	此前开源最佳	Gemini-2.5-Flash	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
WorldSense	47.1	50.9	45.4	54.0	54.1

数据集	此前开源最佳	Gemini-2.5-Flash-Thinking	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Thinking
DailyOmni	69.8	72.7	75.8	76.2
VideoHolmes	55.6	49.5	57.3	57.3

零样本语音生成

数据集	模型	性能
	内容一致性
SEED test-zh \| test-en	Seed-TTS_ICL	1.11 \| 2.24
	Seed-TTS_RL	1.00 \| 1.94
	MaskGCT	2.27 \| 2.62
	E2 TTS	1.97 \| 2.19
	F5-TTS	1.56 \| 1.83
	Spark TTS	1.20 \| 1.98
	CosyVoice 2	1.45 \| 2.57
	CosyVoice 3	0.71 \| 1.45
	Qwen2.5-Omni-7B	1.42 \| 2.33
	Qwen3-Omni-30B-A3B	1.07 \| 1.39

多语言语音生成

评估设置

解码策略：在所有评估基准测试中，对于Qwen3-Omni系列模型，Instruct模型在生成过程中采用贪婪解码，不进行采样。对于Thinking模型，解码参数应取自检查点中的generation_config.json文件。
基准测试专用格式：大多数评估基准测试都附带其专用的ChatML格式，用于嵌入问题或提示词。需要注意的是，评估期间所有视频数据均设置为fps=2。
默认提示词：对于某些基准测试中未包含提示词的任务，我们使用以下提示词设置：

语言	内容一致性			说话人相似度
语言	Qwen3-Omni-30B-A3B	MiniMax	ElevenLabs	Qwen3-Omni-30B-A3B	MiniMax	ElevenLabs
中文	0.716	2.252	16.026	0.772	0.780	0.677
英文	1.069	2.164	2.339	0.773	0.756	0.613
德语	0.777	1.906	0.572	0.738	0.733	0.614
意大利语	1.067	1.543	1.743

任务类型	提示词
中文自动语音识别（ASR）	请将这段中文语音转换为纯文本。
其他语言自动语音识别（ASR）	Transcribe the audio into text.
语音到文本翻译（S2TT）	Listen to the provided <source_language> speech and produce a translation in <target_language> text.
歌词识别	请将歌曲歌词转录为文本，不添加任何标点符号，用换行符分隔各行，仅输出歌词，无需额外解释。

系统提示词：任何评估基准测试均不应设置system prompt。
输入序列：问题或提示词应以用户文本形式输入。除非基准测试另有规定，否则文本应位于序列中多模态数据的之后。例如：

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "/path/to/audio.wav"},
            {"type": "image", "image": "/path/to/image.png"},
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "Describe the audio, image and video."},
        ],
    },
]

Qwen3-Omni

概述

简介

跨模态性能领先：采用早期文本优先预训练与混合多模态训练，实现原生多模态支持。在音频和音视频任务上表现出色的同时，单模态文本和图像性能无退化。在 36 项音频/视频基准测试中，有 22 项达到 SOTA，36 项中 32 项达到开源 SOTA；语音识别、音频理解及语音对话性能与 Gemini 2.5 Pro 相当。
多语言支持：支持 119 种文本语言、19 种语音输入语言和 10 种语音输出语言。
- 语音输入：英语、中文、韩语、日语、德语、俄语、意大利语、法语、西班牙语、葡萄牙语、马来语、荷兰语、印尼语、土耳其语、越南语、粤语、阿拉伯语、乌尔都语。
- 语音输出：英语、中文、法语、德语、俄语、意大利语、西班牙语、葡萄牙语、日语、韩语。
创新架构：基于 MoE 的 Thinker-Talker 设计，结合 AuT 预训练实现强大的通用表征，并采用多码本设计将延迟降至最低。
实时音视频交互：低延迟流式处理，支持自然的对话轮替，可即时生成文本或语音响应。
灵活可控：通过系统提示词自定义模型行为，实现精细化控制和便捷适配。
精细化音频描述生成：Qwen3-Omni-30B-A3B-Captioner 现已开源：这是一款通用、高度详细且低幻觉的音频描述生成模型，填补了开源社区的关键空白。

模型架构

使用场景示例指南

类别	示例指南	描述
音频	语音识别	语音识别，支持多语言及长音频。
	语音翻译	语音转文本/语音转语音翻译。
	音乐分析	对任意音乐进行详细分析与鉴赏，包括风格、流派、节奏等。
	声音分析	对各类音效及音频信号进行描述与分析。
	音频描述	音频内容生成描述，对任意音频输入进行详细说明。
	混合音频分析	对混合音频内容进行分析，例如包含语音、音乐及环境音的音频。
视觉	OCR	复杂图像的光学字符识别。
	目标定位	目标检测与定位。
	图像问答	回答关于任意图像的各类问题。
	图像数学	解决图像中的复杂数学问题，突出展示思维模型的能力。
	视频描述	对视频内容进行详细描述。
	视频导航	根据第一人称运动视频生成导航指令。
	视频场景转换	分析视频中的场景转换。
音视频	音视频问答	在音视频场景下回答各类问题，展示模型对音视频时间对齐的建模能力。
	音视频交互	通过音视频输入与模型进行交互式沟通，包括通过音频指定任务。
	音视频对话	通过音视频输入与模型进行对话交互，展示其在日常聊天和助手类任务中的能力。
智能体	音频函数调用	通过音频输入执行函数调用，实现智能体类行为。
下游任务微调	全模态描述生成器	Qwen3-Omni-30B-A3B-Captioner 的介绍与能力展示，该模型是基于 Qwen3-Omni-30B-A3B-Instruct 微调的下游模型，体现了 Qwen3-Omni 基础模型的强大泛化能力。

快速开始

模型说明与下载

以下是所有Qwen3-Omni模型的说明，请选择并下载符合您需求的模型。

模型名称	说明
Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-30B-A3B的指令微调模型，包含思考器（thinker）和表达器（talker），支持音频、视频和文本输入，输出音频和文本。更多信息请阅读Qwen3-Omni技术报告。
Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-30B-A3B的思考模型，包含思考器（thinker）组件，具备思维链推理能力，支持音频、视频和文本输入，输出文本。更多信息请阅读Qwen3-Omni技术报告。
Qwen3-Omni-30B-A3B-Captioner	基于Qwen3-Omni-30B-A3B-Instruct微调的下游音频细粒度描述模型，可为任意音频输入生成详细且低幻觉的描述文本。该模型包含思考器（thinker），支持音频输入和文本输出。更多信息可参考模型的使用指南。

# Download through ModelScope (recommended for users in Mainland China)
pip install -U modelscope
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Instruct --local_dir ./Qwen3-Omni-30B-A3B-Instruct
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Thinking --local_dir ./Qwen3-Omni-30B-A3B-Thinking
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Captioner --local_dir ./Qwen3-Omni-30B-A3B-Captioner

# Download through Hugging Face
pip install -U "huggingface_hub[cli]"
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./Qwen3-Omni-30B-A3B-Instruct
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Thinking --local-dir ./Qwen3-Omni-30B-A3B-Thinking
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Captioner --local-dir ./Qwen3-Omni-30B-A3B-Captioner

Transformers 使用方法

安装

# If you already have transformers installed, please uninstall it first, or create a new Python environment
# pip uninstall transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate

pip install qwen-omni-utils -U

pip install -U flash-attn --no-build-isolation

代码片段

以下是使用 transformers 和 qwen_omni_utils 调用 Qwen3-Omni 的代码片段：

import soundfile as sf

from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
# MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)

processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
            {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
            {"type": "text", "text": "What can you see and hear? Answer in one short sentence."}
        ],
    },
]

# Set whether to use audio in video
USE_AUDIO_IN_VIDEO = True

# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, 
                   audio=audios, 
                   images=images, 
                   videos=videos, 
                   return_tensors="pt", 
                   padding=True, 
                   use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)

# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, 
                                 speaker="Ethan", 
                                 thinker_return_dict_in_generate=True,
                                 use_audio_in_video=USE_AUDIO_IN_VIDEO)

text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :],
                              skip_special_tokens=True,
                              clean_up_tokenization_spaces=False)
print(text)
if audio is not None:
    sf.write(
        "output.wav",
        audio.reshape(-1).detach().cpu().numpy(),
        samplerate=24000,
    )

以下是一些更高级的使用示例。您可以展开下方的部分以了解更多信息。

批量推理

当设置 return_audio=False 时，模型可以将由文本、图像、音频和视频等多种类型混合样本组成的输入进行批处理。以下是一个示例。

from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
# MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
model.disable_talker()

processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

# Conversation with image only
conversation1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
            {"type": "text", "text": "What can you see in this image? Answer in one sentence."},
        ]
    }
]

# Conversation with audio only
conversation2 = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
            {"type": "text", "text": "What can you hear in this audio?"},
        ]
    }
]

# Conversation with pure text and system prompt
conversation3 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen-Omni."}
        ],
    },
    {
        "role": "user",
        "content": "Who are you?"
    }
]

# Conversation with mixed media
conversation4 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
            {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
            {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
        ],
    }
]

# Combine messages for batch processing
conversations = [conversation1, conversation2, conversation3, conversation4]

# Set whether to use audio in video
USE_AUDIO_IN_VIDEO = True

# Preparation for batch inference
text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)

inputs = processor(text=text, 
                   audio=audios, 
                   images=images, 
                   videos=videos, 
                   return_tensors="pt", 
                   padding=True, 
                   use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)

# Batch inference does not support returning audio
text_ids, audio = model.generate(**inputs,
                                 return_audio=False,
                                 thinker_return_dict_in_generate=True,
                                 use_audio_in_video=USE_AUDIO_IN_VIDEO)

text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :],
                              skip_special_tokens=True,
                              clean_up_tokenization_spaces=False)
print(text)

是否使用音频输出

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
model.disable_talker()

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
...
text_ids, _ = model.generate(..., return_audio=False)```

</details>

<details>
<summary>Change voice type of output audio</summary>

Qwen3-Omni supports changing the voice of the output audio. The `"Qwen/Qwen3-Omni-30B-A3B-Instruct"` checkpoint supports three voice types as follows:

| Voice Type | Gender | Description |
|------------|--------|-------------|
| Ethan      | Male   | A bright, upbeat voice with infectious energy and a warm, approachable vibe. |
| Chelsie    | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity. |
| Aiden      | Male   | A warm, laid-back American voice with a gentle, boyish charm. |

Users can use the `speaker` parameter of the `generate` function to specify the voice type. By default, if `speaker` is not specified, the voice type is `Ethan`.

```python
text_ids, audio = model.generate(..., speaker="Ethan")

text_ids, audio = model.generate(..., speaker="Chelsie")

text_ids, audio = model.generate(..., speaker="Aiden")

vLLM 使用方法

安装

git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
# Install the Transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

推理

import os
import torch

from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

if __name__ == '__main__':
    # vLLM engine v1 not supported yet
    os.environ['VLLM_USE_V1'] = '0'

    MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
    # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

    llm = LLM(
            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
            tensor_parallel_size=torch.cuda.device_count(),
            limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
            max_num_seqs=8,
            max_model_len=32768,
            seed=1234,
    )

    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        max_tokens=16384,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}
            ], 
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)

    inputs = {
        'prompt': text,
        'multi_modal_data': {},
        "mm_processor_kwargs": {
            "use_audio_in_video": True,
        },
    }

    if images is not None:
        inputs['multi_modal_data']['image'] = images
    if videos is not None:
        inputs['multi_modal_data']['video'] = videos
    if audios is not None:
        inputs['multi_modal_data']['audio'] = audios

    outputs = llm.generate([inputs], sampling_params=sampling_params)

    print(outputs[0].outputs[0].text)

以下是一些更高级的使用示例。你可以展开下方的章节了解更多信息。

批量推理

使用 vLLM 可以实现快速的批量推理，这能帮助你高效处理大量数据或进行基准测试。参考以下代码示例：

import os
import torch

from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

def build_input(processor, messages, use_audio_in_video):
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    audios, images, videos = process_mm_info(messages, use_audio_in_video=use_audio_in_video)

    inputs = {
        'prompt': text,
        'multi_modal_data': {},
        "mm_processor_kwargs": {
            "use_audio_in_video": use_audio_in_video,
        },
    }

    if images is not None:
        inputs['multi_modal_data']['image'] = images
    if videos is not None:
        inputs['multi_modal_data']['video'] = videos
    if audios is not None:
        inputs['multi_modal_data']['audio'] = audios
    
    return inputs

if __name__ == '__main__':
    # vLLM engine v1 not supported yet
    os.environ['VLLM_USE_V1'] = '0'

    MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
    # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

    llm = LLM(
            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
            tensor_parallel_size=torch.cuda.device_count(),
            limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
            max_num_seqs=8,
            max_model_len=32768,
            seed=1234,
    )

    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        max_tokens=16384,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

    # Conversation with image only
    conversation1 = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
                {"type": "text", "text": "What can you see in this image? Answer in one sentence."},
            ]
        }
    ]

    # Conversation with audio only
    conversation2 = [
        {
            "role": "user",
            "content": [
                {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
                {"type": "text", "text": "What can you hear in this audio?"},
            ]
        }
    ]

    # Conversation with pure text and system prompt
    conversation3 = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are Qwen-Omni."}
            ],
        },
        {
            "role": "user",
            "content": "Who are you? Answer in one sentence."
        }
    ]

    # Conversation with mixed media
    conversation4 = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
                {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/cookbook/asr_fr.wav"},
                {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
            ],
        }
    ]
    
    USE_AUDIO_IN_VIDEO = True

    # Combine messages for batch processing
    conversations = [conversation1, conversation2, conversation3, conversation4]
    inputs = [build_input(processor, messages, USE_AUDIO_IN_VIDEO) for messages in conversations]

    outputs = llm.generate(inputs, sampling_params=sampling_params)

    result = [outputs[i].outputs[0].text for i in range(len(outputs))]
    print(result)

vLLM 服务使用方法

# Qwen3-Omni-30B-A3B-Instruct for single GPU
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 32768 --allowed-local-media-path / -tp 1
# Qwen3-Omni-30B-A3B-Instruct for multi-GPU (example on 4 GPUs)
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 65536 --allowed-local-media-path / -tp 4
# Qwen/Qwen3-Omni-30B-A3B-Thinking for single GPU
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 32768 --allowed-local-media-path / -tp 1
# Qwen/Qwen3-Omni-30B-A3B-Thinking for multi-GPU (example on 4 GPUs)
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 65536 --allowed-local-media-path / -tp 4

然后您可以按如下方式使用聊天 API（例如通过 curl）：

curl http://localhost:8901/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}},
        {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}},
        {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
    ]}
    ]
    }'

使用提示（建议阅读）

最低 GPU 内存要求

模型	精度	15秒视频	30秒视频	60秒视频	120秒视频
Qwen3-Omni-30B-A3B-Instruct	BF16	78.85 GB	88.52 GB	107.74 GB	144.81 GB
Qwen3-Omni-30B-A3B-Thinking	BF16	68.74 GB	77.79 GB	95.76 GB	131.65 GB

音视频交互提示词

user_system_prompt = "You are Qwen-Omni, a smart voice assistant created by Alibaba Qwen."
message = {
    "role": "system",
    "content": [
          {"type": "text", "text": f"{user_system_prompt} You are a virtual voice assistant with no gender or age.\nYou are communicating with the user.\nIn user messages, “I/me/my/we/our” refer to the user and “you/your” refer to the assistant. In your replies, address the user as “you/your” and yourself as “I/me/my”; never mirror the user’s pronouns—always shift perspective. Keep original pronouns only in direct quotes; if a reference is unclear, ask a brief clarifying question.\nInteract with users using short(no more than 50 words), brief, straightforward language, maintaining a natural tone.\nNever use formal phrasing, mechanical expressions, bullet points, overly structured language. \nYour output must consist only of the spoken content you want the user to hear. \nDo not include any descriptions of actions, emotions, sounds, or voice changes. \nDo not use asterisks, brackets, parentheses, or any other symbols to indicate tone or actions. \nYou must answer users' audio or text questions, do not directly describe the video content. \nYou should communicate in the same language strictly as the user unless they request otherwise.\nWhen you are uncertain (e.g., you can't see/hear clearly, don't understand, or the user makes a comment rather than asking a question), use appropriate questions to guide the user to continue the conversation.\nKeep replies concise and conversational, as if talking face-to-face."}
    ]
}

思维模型最佳实践

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "/path/to/audio.wav"},
            {"type": "image", "image": "/path/to/image.png"},
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "Analyze this audio, image, and video together."},
        ], 
    }
]

在视频中使用音频

# In data preprocessing
audios, images, videos = process_mm_info(messages, use_audio_in_video=True)

# For Transformers
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", 
                   padding=True, use_audio_in_video=True)
text_ids, audio = model.generate(..., use_audio_in_video=True)

# For vLLM
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = {
    'prompt': text,
    'multi_modal_data': {},
    "mm_processor_kwargs": {
        "use_audio_in_video": True,
    },
}

需要注意的是，在多轮对话过程中，use_audio_in_video参数在各步骤中必须保持一致设置，否则可能会出现意外结果。

评估

Qwen3-Omni 的性能表现

文本到文本

		GPT-4o-0327	Qwen3-235B-A22B Non Thinking	Qwen3-30B-A3B-Instruct-2507	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
通用任务	MMLU-Redux	91.3	89.2	89.3	86.6	86.8
通用任务	GPQA	66.9	62.9	70.4	69.6	69.7
推理	AIME25	26.7	24.7	61.3	65.0	65.9
推理	ZebraLogic	52.6	37.7	90.0	76.0	76.1
代码	MultiPL-E	82.7	79.3	83.8	81.4	81.5
对齐任务	IFEval	83.9	83.2	84.7	81.0	81.7
	Creative Writing v3	84.9	80.4	86.0	80.6	81.8
	WritingBench	75.5	77.0	85.5	82.6	83.0
智能体	BFCL-v3	66.5	68.0	65.1	64.4	65.0
多语言任务	MultiIF	70.4	70.2	67.9	64.0	64.7
多语言任务	PolyMATH	25.5	27.0	43.1	37.9	39.3

		Gemini-2.5-Flash Thinking	Qwen3-235B-A22B Thinking	Qwen3-30B-A3B-Thinking-2507	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Thinking
通用任务	MMLU-Redux	92.1	92.7	91.4	88.8	89.7
通用任务	GPQA	82.8	71.1	73.4	73.1	73.1
推理	AIME25	72.0	81.5	85.0	73.7	74.0
推理	LiveBench 20241125	74.3	77.1	76.8	71.8	70.3
代码	MultiPL-E	84.5	79.9	81.3	80.6	81.0
对齐任务	IFEval	89.8	83.4	88.9	85.1	85.2
	Arena-Hard v2	56.7	61.5	56.0	55.1	57.8
	Creative Writing v3	85.0	84.6	84.4	82.5	83.6
	WritingBench	83.9	80.3	85.0	85.5	85.9
智能体	BFCL-v3	68.6	70.8	72.4	63.2	64.5
多语言任务	MultiIF	74.4	71.9	76.4	72.9	73.2
多语言任务	PolyMATH	49.8	54.7	52.6	47.1	48.7

音频到文本

	Seed-ASR	Voxtral-Mini	Voxtral-Small	GPT-4o-Transcribe	Gemini-2.5-Pro	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
中英文语音识别（词错误率）
Wenetspeech 网络 \| 会议	4.66 \| 5.69	24.30 \| 31.53	20.33 \| 26.08	15.30 \| 32.27	14.43 \| 13.47	5.91 \| 7.65	4.69 \| 5.89	4.62 \| 5.75
Librispeech 清晰 \| 其他	1.58 \| 2.84	1.88 \| 4.12	1.56 \| 3.30	1.39 \| 3.75

	GPT-4o-Audio	Gemini-2.5-Flash	Gemini-2.5-Pro	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Instruct	Qwen3-Omni-Flash-Thinking
VoiceBench
AlpacaEval	95.6	96.1	94.3	89.9	94.8	96.4	95.4	96.8
CommonEval	89.8	88.3	88.4	76.7	90.8	90.5	91.0	90.9
WildVoice	91.6	92.1	93.4	77.7	91.6	90.5	92.3	90.9
SD-QA	75.5	84.5	90.1	56.4	76.9	78.1	76.8	78.5
MMSU	80.3	66.1	71.1	61.7	68.1	83.0	68.4	84.3
OpenBookQA	89.2	56.9	92.3	80.9	89.7	94.3	91.4	95.0
BBH	84.1	83.9	92.6	66.7	80.4	88.9	80.6	89.6
IFEval	76.0	83.8	85.7	53.5	77.8	80.6	75.2	80.8
AdvBench	98.7	98.9	98.1	99.2	99.3	97.2	99.4	98.9
Overall	86.8	83.4	89.6	73.6	85.5	88.8	85.6	89.5
Audio Reasoning
MMAU-v05.15.25	62.5	71.8	77.4	65.5	77.5	75.4	77.6	76.5
MMSU	56.4	70.2	77.7	62.6	69.0	70.2	69.1	71.3

	Best Specialist Models	GPT-4o-Audio	Gemini-2.5-Pro	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
RUL-MuchoMusic	47.6 (Audio Flamingo 3)	36.1	49.4	47.3	52.0	52.1
GTZAN Acc.	87.9 (CLaMP 3)	76.5	81.0	81.7	93.0	93.1
MTG Genre Micro F1	35.8 (MuQ-MuLan)	25.3	32.6	32.5	39.0	39.5
MTG Mood/Theme Micro F1	10.9 (MuQ-MuLan)	11.3	14.1	8.9	21.0	21.7
MTG Instrument Micro F1	39.8 (MuQ-MuLan)	34.2	33.0	22.6	40.5	40.7
MTG Top50 Micro F1	33.2 (MuQ-MuLan)

数据集	Gemini-2.5-flash-thinking	InternVL-3.5-241B-A28B	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Thinking
通用视觉问答
MMStar	75.5	77.9	74.9	75.5
HallusionBench	61.1	57.3	62.8	63.4
MM-MT-Bench	7.8	–	8.0	8.0
数学与科学工程
MMMU_val	76.9	77.7	75.6	75.0
MMMU_pro	65.8	–	60.5	60.8
MathVista_mini	77.6	82.7	80.0	81.2
MathVision_full	62.3	63.9	62.9	63.8
文档理解
AI2D_test	88.6	87.3	86.1	86.8
ChartQA_test	–	88.0	89.5	89.3
计数
CountBench	88.6	–	88.6	92.5
视频理解
Video-MME	79.6	72.9	69.7	69.8
LVBench	64.5	–	49.0	49.5
MLVU	82.1	78.2	72.9	73.9

视听转文本

数据集	此前开源最佳	Gemini-2.5-Flash	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
WorldSense	47.1	50.9	45.4	54.0	54.1

数据集	此前开源最佳	Gemini-2.5-Flash-Thinking	Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-Flash-Thinking
DailyOmni	69.8	72.7	75.8	76.2
VideoHolmes	55.6	49.5	57.3	57.3

零样本语音生成

数据集	模型	性能
	内容一致性
SEED test-zh \| test-en	Seed-TTS_ICL	1.11 \| 2.24
	Seed-TTS_RL	1.00 \| 1.94
	MaskGCT	2.27 \| 2.62
	E2 TTS	1.97 \| 2.19
	F5-TTS	1.56 \| 1.83
	Spark TTS	1.20 \| 1.98
	CosyVoice 2	1.45 \| 2.57
	CosyVoice 3	0.71 \| 1.45
	Qwen2.5-Omni-7B	1.42 \| 2.33
	Qwen3-Omni-30B-A3B	1.07 \| 1.39

多语言语音生成

评估设置

解码策略：在所有评估基准测试中，对于Qwen3-Omni系列模型，Instruct模型在生成过程中采用贪婪解码，不进行采样。对于Thinking模型，解码参数应取自检查点中的generation_config.json文件。
基准测试专用格式：大多数评估基准测试都附带其专用的ChatML格式，用于嵌入问题或提示词。需要注意的是，评估期间所有视频数据均设置为fps=2。
默认提示词：对于某些基准测试中未包含提示词的任务，我们使用以下提示词设置：

语言	内容一致性			说话人相似度
语言	Qwen3-Omni-30B-A3B	MiniMax	ElevenLabs	Qwen3-Omni-30B-A3B	MiniMax	ElevenLabs
中文	0.716	2.252	16.026	0.772	0.780	0.677
英文	1.069	2.164	2.339	0.773	0.756	0.613
德语	0.777	1.906	0.572	0.738	0.733	0.614
意大利语	1.067	1.543	1.743

任务类型	提示词
中文自动语音识别（ASR）	请将这段中文语音转换为纯文本。
其他语言自动语音识别（ASR）	Transcribe the audio into text.
语音到文本翻译（S2TT）	Listen to the provided <source_language> speech and produce a translation in <target_language> text.
歌词识别	请将歌曲歌词转录为文本，不添加任何标点符号，用换行符分隔各行，仅输出歌词，无需额外解释。

系统提示词：任何评估基准测试均不应设置system prompt。
输入序列：问题或提示词应以用户文本形式输入。除非基准测试另有规定，否则文本应位于序列中多模态数据的之后。例如：

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "/path/to/audio.wav"},
            {"type": "image", "image": "/path/to/image.png"},
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "Describe the audio, image and video."},
        ],
    },
]

	Seed-ASR	Voxtral-Mini	Voxtral-Small	GPT-4o-Transcribe	Gemini-2.5-Pro	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
中英文语音识别（词错误率）
Wenetspeech 网络 \| 会议	4.66 \| 5.69	24.30 \| 31.53	20.33 \| 26.08	15.30 \| 32.27	14.43 \| 13.47	5.91 \| 7.65	4.69 \| 5.89	4.62 \| 5.75
Librispeech 清晰 \| 其他	1.58 \| 2.84	1.88 \| 4.12	1.56 \| 3.30	1.39 \| 3.75

	Seed-ASR	Voxtral-Mini	Voxtral-Small	GPT-4o-Transcribe	Gemini-2.5-Pro	Qwen2.5-Omni	Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-Flash-Instruct
中英文语音识别（词错误率）
Wenetspeech 网络 \| 会议	4.66 \| 5.69	24.30 \| 31.53	20.33 \| 26.08	15.30 \| 32.27	14.43 \| 13.47	5.91 \| 7.65	4.69 \| 5.89	4.62 \| 5.75
Librispeech 清晰 \| 其他	1.58 \| 2.84	1.88 \| 4.12	1.56 \| 3.30	1.39 \| 3.75