引言

2025年9月12日，Qwen3-Omni-30B模型正式开源。
昇腾基于vLLM Ascend插件支持Qwen3-Omni-30B模型。vLLM Ascend 插件（vllm-ascend）是一个由社区维护的硬件插件，用于在 Ascend NPU 上运行 vLLM。
通过使用 vLLM Ascend 插件，流行的开源模型，包括 Transformer 类、混合专家、嵌入式、多模态大模型等，都可以在 Ascend NPU 上无缝运行。以下为Qwen3-Omni-30B模型的部署指南。

一、运行环境准备

表 1 版本配套表

配套	版本	环境准备指导
Python	3.10.12	-
torch	2.8.0	-
torch_npu	2.8.0rc1	-

1 CANN安装

1.1 获取CANN安装包

CANN社区版下载链接

1.2 CANN安装

# 增加软件包可执行权限，{version}表示软件版本号，{arch}表示CPU架构，{soc}表示昇腾AI处理器的版本。
chmod +x ./Ascend-cann-toolkit_{version}_linux-{arch}.run
chmod +x ./Ascend-cann-kernels-{soc}_{version}_linux.run
chmod +x ./Ascend-cann-nnal_{version}_linux-{arch}.run
# 校验软件包安装文件的一致性和完整性
./Ascend-cann-toolkit_{version}_linux-{arch}.run --check
./Ascend-cann-kernels-{soc}_{version}_linux.run --check
./Ascend-cann-nnal{version}_linux-{arch}.run --check
# 安装
./Ascend-cann-toolkit_{version}_linux-{arch}.run --install
./Ascend-cann-kernels-{soc}_{version}_linux.run --install
./Ascend-cann-nnal{version}_linux-{arch}.run --torch_atb --install

# 设置环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

2 安装apt软件包

apt-get update  
apt-get install libnuma-dev  
apt-get install ffmpeg

3 安装vllm

3.1 下载

git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git  
cd vllm

3.2 修改安装脚本

3.2.1 修改setup.py：131

修改说明：增加num_jobs = 8一行。避免资源不足导致的编译失败。添加时注意代码缩进。

    def compute_num_jobs(self):
        # `num_jobs` is either the value of the MAX_JOBS environment variable
        # (if defined) or the number of CPUs available.
        num_jobs = envs.MAX_JOBS
        if num_jobs is not None:
            num_jobs = int(num_jobs)
            logger.info("Using MAX_JOBS=%d as the number of jobs.", num_jobs)
        else:
            try:
                # os.sched_getaffinity() isn't universally available, so fall
                #  back to os.cpu_count() if we get an error here.
                num_jobs = len(os.sched_getaffinity(0))
            except AttributeError:
                num_jobs = os.cpu_count()

        nvcc_threads = None
        if _is_cuda() and get_nvcc_cuda_version() >= Version("11.2"):
            # `nvcc_threads` is either the value of the NVCC_THREADS
            # environment variable (if defined) or 1.
            # when it is set, we reduce `num_jobs` to avoid
            # overloading the system.
            nvcc_threads = envs.NVCC_THREADS
            if nvcc_threads is not None:
                nvcc_threads = int(nvcc_threads)
                logger.info(
                    "Using NVCC_THREADS=%d as the number of nvcc threads.",
                    nvcc_threads)
            else:
                nvcc_threads = 1
            num_jobs = max(1, num_jobs // nvcc_threads)

        num_jobs = 8
        return num_jobs, nvcc_threads

3.2.2 修改cmake/cpu_extension.cmake：171

修改说明：两处关于github网址的修改。避免github下载失败

if (AVX512_FOUND AND NOT AVX512_DISABLED)
    FetchContent_Declare(
        oneDNN
        GIT_REPOSITORY https://githubfast.com/oneapi-src/oneDNN.git
        GIT_TAG  v3.7.1
        GIT_PROGRESS TRUE
        GIT_SHALLOW TRUE
    )

    set(ONEDNN_LIBRARY_TYPE "STATIC")
    set(ONEDNN_BUILD_DOC "OFF")
    set(ONEDNN_BUILD_EXAMPLES "OFF")
    set(ONEDNN_BUILD_TESTS "OFF")
    set(ONEDNN_ENABLE_WORKLOAD "INFERENCE")
    set(ONEDNN_ENABLE_PRIMITIVE "MATMUL;REORDER")
    set(ONEDNN_BUILD_GRAPH "OFF")
    set(ONEDNN_ENABLE_JIT_PROFILING "OFF")
    set(ONEDNN_ENABLE_ITT_TASKS "OFF")
    set(ONEDNN_ENABLE_MAX_CPU_ISA "OFF")
    set(ONEDNN_ENABLE_CPU_ISA_HINTS "OFF")
    set(CMAKE_POLICY_DEFAULT_CMP0077 NEW)

    FetchContent_MakeAvailable(oneDNN)

    list(APPEND LIBS dnnl)
elseif(POWER10_FOUND)
    FetchContent_Declare(
        oneDNN
        GIT_REPOSITORY https://githubfast.com/oneapi-src/oneDNN.git
        GIT_TAG v3.7.2
        GIT_PROGRESS TRUE
        GIT_SHALLOW TRUE
    )

3.3 执行安装

pip install -e .

如出现无原因退出的问题，调大内存到 500G以上。

3.4 安装后查看版本

[root:transformers]$ pip show vllm
Name: vllm
Version: 0.9.3.dev6+gca66cbff0.d20251101
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email: 
License-Expression: Apache-2.0
Location: /usr/local/lib/python3.10/dist-packages
Editable project location: /workspace/vllm

4 安装最新版Transformers

4.1 下载

git clone https://github.com/huggingface/transformers

4.2 卸载旧版本

pip uninstall transformers

4.3 安装

cd transformers/  
git reset --hard a0bf5a82ee
pip install -e .

4.4 安装后查看版本

[root:transformers]$ pip show transformers
Name: transformers
Version: 5.0.0.dev0
Summary: Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages

5 安装依赖库

5.1 直接安装依赖库

执行如下命令：

pip install accelerate  
pip install qwen-omni-utils -U

5.2 指定版本安装依赖库

5.2.1 编辑待安装的文件列表，并指定版本

执行命令：

vi omni_requirements.txt

输入以下内容后，保存退出

torch==2.8.0
torch_npu==2.8.0rc1
torchaudio==2.8.0
torchvision==0.23.0
vllm-ascend==0.9.2rc1

5.2.2 执行

pip install -r omni_requirements.txt --no-deps

5.3 Intel环境下安装

如服务器为Intel环境，执行如下命令：

pip install intel_extension_for_pytorch==2.8.0

5.4 修改vllm-ascend代码

/usr/local/lib/python3.10/dist-packages/vllm_ascend/worker/model_runner_v1.py: 1085

将"nputs_embeds)"，替换为："inputs_embeds[0] if isinstance(inputs_embeds, tuple) else inputs_embeds) #inputs_embeds"

        if self.is_multimodal_model:
            # NOTE(woosuk): To unify token ids and soft tokens (vision
            # embeddings), we always use embeddings (rather than token ids)
            # as input to the multimodal model, even when the input is text.
            input_ids = self.input_ids[:total_num_scheduled_tokens]
            if mm_embeds:
                inputs_embeds = self.model.get_input_embeddings(
                    input_ids, mm_embeds)
            else:
                inputs_embeds = self.model.get_input_embeddings(input_ids)
            # TODO(woosuk): Avoid the copy. Optimize.
            self.inputs_embeds[:total_num_scheduled_tokens].copy_(
                inputs_embeds[0] if isinstance(inputs_embeds, tuple) else inputs_embeds) #inputs_embeds    
            inputs_embeds = self.inputs_embeds[:num_input_tokens]
            input_ids = None

5.5 安装后查看版本

使用pip list查看，相关关键依赖库的版本情况如下：

Package	Version
intel_extension_for_pytorch	2.8.0
torch	2.8.0+cpu
torch_npu	2.8.0rc1
torchaudio	2.8.0+cpu
torchvision	0.23.0+cpu
vllm	0.9.3.dev6+gca66cbff0.d20251101
vllm-ascend	0.9.2rc1

二、下载权重

1 选择模型

1.1 进入魔塔社区

https://modelscope.cn/home

1.2 搜索“Qwen3-Omni-30B”

Model Name	Description
Qwen3-Omni-30B-A3B-Instruct	Qwen3-Omni-30B-A3B的指令模型，包含思考器（thinker）和对话器（talker），支持音频、视频和文本输入，提供音频和文本输出。
Qwen3-Omni-30B-A3B-Thinking	Qwen3-Omni-30B-A3B的思考模型，包含思考器组件，具备思维链推理能力，支持音频、视频和文本输入，提供文本输出。
Qwen3-Omni-30B-A3B-Captioner	基于Qwen3-Omni-30B-A3B-Instruct微调得到的下游音频细粒度描述模型，可为任意音频输入生成详细且低幻觉的描述文本。该模型包含思考器，支持音频输入和文本输出。

根据需要选择模型。下文以Qwen3-Omni-30B-A3B-Instruct为例。

1.3 拷贝下载命令行，在下文中修改

modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Instruct

2 在昇腾服务器进行安装下载

2.1 安装modelscope

pip install modelscope

2.2 进入保存模型的根目录

2.3 执行命令，下载模型

将从魔塔社区拷贝的下载命令行，补充下载目录参数--local_dir，并执行：

modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Instruct --local_dir ./Qwen3-Omni-30B-A3B-Instruct

三、运行指导

1 启动服务

执行命令

vllm serve ./Qwen3-Omni-30B-A3B-Instruct/ \
--served-model-name qwen3-Omni \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--enforce-eager \
--port 8106

其中，tensor-parallel-size不可小于2

2 基本功能测试

2.1 本机-文本

curl http://127.0.0.1:8106/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-Omni",
"messages": [
{"role": "system", "content": " You are a helpful language and speech assistant."},
{"role": "user", "content": "你是谁."}
],
"temperature": 0.7,
"max_tokens": 2048
}'

模型输出示例： {"id":"chatcmpl-7df562f798ee469d81b25bab1c845ff8","object":"chat.completion","created":1761979861,"model":"qwen3-Omni","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"我是阿里云研发的多模态超大规模语言模型，我叫通义千问Omni。有什么我可以帮助你的吗？","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":54,"completion_tokens":29,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

2.2 本机-声音

curl http://127.0.0.1:8106/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-Omni",
"messages": [
{"role": "system", "content": "You are a helpful language and speech assistant."},
{"role": "user", "content": [   
    {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/cough.wav"}},
    {"type": "text", "text": "Caption the audio."}
]}    
],
"temperature": 0.7,
"max_tokens": 2048
}'

模型输出示例： {"id":"chatcmpl-bdbcc4e24cab48989ab25983dfb92c94","object":"chat.completion","created":1761980287,"model":"qwen3-Omni","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"一个人正在咳嗽。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":66,"total_tokens":73,"completion_tokens":7,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

2.3 本机-图片+声音

curl http://127.0.0.1:8106/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "qwen3-Omni",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}},
        {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}},
        {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
    ]}
    ]
    }'

模型输出示例： {"id":"chatcmpl-c99b7a9be6f0452fbc1b49c0168f381e","object":"chat.completion","created":1761980331,"model":"qwen3-Omni","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"图片中展示了四辆豪华和运动型汽车——一辆白色劳斯莱斯、一辆灰色奔驰GLE SUV、一辆红色法拉利Portofino M和一辆白色保时捷911——同时伴有一个人咳嗽的可听见声音。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":6115,"total_tokens":6167,"completion_tokens":52,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

3 vllm支持范围

Qwen3-Omni 的 vLLM serve 目前仅支持 thinker 模型。

引言