2025年9月12日,Qwen3-Omni-30B模型正式开源。
昇腾基于vLLM Ascend插件支持Qwen3-Omni-30B模型。vLLM Ascend 插件(vllm-ascend)是一个由社区维护的硬件插件,用于在 Ascend NPU 上运行 vLLM。
通过使用 vLLM Ascend 插件,流行的开源模型,包括 Transformer 类、混合专家、嵌入式、多模态大模型等,都可以在 Ascend NPU 上无缝运行。以下为Qwen3-Omni-30B模型的部署指南。
| 配套 | 版本 | 环境准备指导 |
|---|---|---|
| Python | 3.10.12 | - |
| torch | 2.8.0 | - |
| torch_npu | 2.8.0rc1 | - |
# 增加软件包可执行权限,{version}表示软件版本号,{arch}表示CPU架构,{soc}表示昇腾AI处理器的版本。
chmod +x ./Ascend-cann-toolkit_{version}_linux-{arch}.run
chmod +x ./Ascend-cann-kernels-{soc}_{version}_linux.run
chmod +x ./Ascend-cann-nnal_{version}_linux-{arch}.run
# 校验软件包安装文件的一致性和完整性
./Ascend-cann-toolkit_{version}_linux-{arch}.run --check
./Ascend-cann-kernels-{soc}_{version}_linux.run --check
./Ascend-cann-nnal{version}_linux-{arch}.run --check
# 安装
./Ascend-cann-toolkit_{version}_linux-{arch}.run --install
./Ascend-cann-kernels-{soc}_{version}_linux.run --install
./Ascend-cann-nnal{version}_linux-{arch}.run --torch_atb --install
# 设置环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.shapt-get update
apt-get install libnuma-dev
apt-get install ffmpeggit clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm修改说明:增加num_jobs = 8一行。避免资源不足导致的编译失败。添加时注意代码缩进。
def compute_num_jobs(self):
# `num_jobs` is either the value of the MAX_JOBS environment variable
# (if defined) or the number of CPUs available.
num_jobs = envs.MAX_JOBS
if num_jobs is not None:
num_jobs = int(num_jobs)
logger.info("Using MAX_JOBS=%d as the number of jobs.", num_jobs)
else:
try:
# os.sched_getaffinity() isn't universally available, so fall
# back to os.cpu_count() if we get an error here.
num_jobs = len(os.sched_getaffinity(0))
except AttributeError:
num_jobs = os.cpu_count()
nvcc_threads = None
if _is_cuda() and get_nvcc_cuda_version() >= Version("11.2"):
# `nvcc_threads` is either the value of the NVCC_THREADS
# environment variable (if defined) or 1.
# when it is set, we reduce `num_jobs` to avoid
# overloading the system.
nvcc_threads = envs.NVCC_THREADS
if nvcc_threads is not None:
nvcc_threads = int(nvcc_threads)
logger.info(
"Using NVCC_THREADS=%d as the number of nvcc threads.",
nvcc_threads)
else:
nvcc_threads = 1
num_jobs = max(1, num_jobs // nvcc_threads)
num_jobs = 8
return num_jobs, nvcc_threads修改说明:两处关于github网址的修改。避免github下载失败
if (AVX512_FOUND AND NOT AVX512_DISABLED)
FetchContent_Declare(
oneDNN
GIT_REPOSITORY https://githubfast.com/oneapi-src/oneDNN.git
GIT_TAG v3.7.1
GIT_PROGRESS TRUE
GIT_SHALLOW TRUE
)
set(ONEDNN_LIBRARY_TYPE "STATIC")
set(ONEDNN_BUILD_DOC "OFF")
set(ONEDNN_BUILD_EXAMPLES "OFF")
set(ONEDNN_BUILD_TESTS "OFF")
set(ONEDNN_ENABLE_WORKLOAD "INFERENCE")
set(ONEDNN_ENABLE_PRIMITIVE "MATMUL;REORDER")
set(ONEDNN_BUILD_GRAPH "OFF")
set(ONEDNN_ENABLE_JIT_PROFILING "OFF")
set(ONEDNN_ENABLE_ITT_TASKS "OFF")
set(ONEDNN_ENABLE_MAX_CPU_ISA "OFF")
set(ONEDNN_ENABLE_CPU_ISA_HINTS "OFF")
set(CMAKE_POLICY_DEFAULT_CMP0077 NEW)
FetchContent_MakeAvailable(oneDNN)
list(APPEND LIBS dnnl)
elseif(POWER10_FOUND)
FetchContent_Declare(
oneDNN
GIT_REPOSITORY https://githubfast.com/oneapi-src/oneDNN.git
GIT_TAG v3.7.2
GIT_PROGRESS TRUE
GIT_SHALLOW TRUE
)pip install -e . 如出现无原因退出的问题,调大内存到 500G以上。
[root:transformers]$ pip show vllm
Name: vllm
Version: 0.9.3.dev6+gca66cbff0.d20251101
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email:
License-Expression: Apache-2.0
Location: /usr/local/lib/python3.10/dist-packages
Editable project location: /workspace/vllmgit clone https://github.com/huggingface/transformerspip uninstall transformerscd transformers/
git reset --hard a0bf5a82ee
pip install -e .[root:transformers]$ pip show transformers
Name: transformers
Version: 5.0.0.dev0
Summary: Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages执行如下命令:
pip install accelerate
pip install qwen-omni-utils -U执行命令:
vi omni_requirements.txt输入以下内容后,保存退出
torch==2.8.0
torch_npu==2.8.0rc1
torchaudio==2.8.0
torchvision==0.23.0
vllm-ascend==0.9.2rc1pip install -r omni_requirements.txt --no-deps如服务器为Intel环境,执行如下命令:
pip install intel_extension_for_pytorch==2.8.0/usr/local/lib/python3.10/dist-packages/vllm_ascend/worker/model_runner_v1.py: 1085
将"nputs_embeds)",替换为:"inputs_embeds[0] if isinstance(inputs_embeds, tuple) else inputs_embeds) #inputs_embeds"
if self.is_multimodal_model:
# NOTE(woosuk): To unify token ids and soft tokens (vision
# embeddings), we always use embeddings (rather than token ids)
# as input to the multimodal model, even when the input is text.
input_ids = self.input_ids[:total_num_scheduled_tokens]
if mm_embeds:
inputs_embeds = self.model.get_input_embeddings(
input_ids, mm_embeds)
else:
inputs_embeds = self.model.get_input_embeddings(input_ids)
# TODO(woosuk): Avoid the copy. Optimize.
self.inputs_embeds[:total_num_scheduled_tokens].copy_(
inputs_embeds[0] if isinstance(inputs_embeds, tuple) else inputs_embeds) #inputs_embeds
inputs_embeds = self.inputs_embeds[:num_input_tokens]
input_ids = None使用pip list查看,相关关键依赖库的版本情况如下:
| Package | Version |
|---|---|
| intel_extension_for_pytorch | 2.8.0 |
| torch | 2.8.0+cpu |
| torch_npu | 2.8.0rc1 |
| torchaudio | 2.8.0+cpu |
| torchvision | 0.23.0+cpu |
| vllm | 0.9.3.dev6+gca66cbff0.d20251101 |
| vllm-ascend | 0.9.2rc1 |
| Model Name | Description |
|---|---|
| Qwen3-Omni-30B-A3B-Instruct | Qwen3-Omni-30B-A3B的指令模型,包含思考器(thinker)和对话器(talker),支持音频、视频和文本输入,提供音频和文本输出。 |
| Qwen3-Omni-30B-A3B-Thinking | Qwen3-Omni-30B-A3B的思考模型,包含思考器组件,具备思维链推理能力,支持音频、视频和文本输入,提供文本输出。 |
| Qwen3-Omni-30B-A3B-Captioner | 基于Qwen3-Omni-30B-A3B-Instruct微调得到的下游音频细粒度描述模型,可为任意音频输入生成详细且低幻觉的描述文本。该模型包含思考器,支持音频输入和文本输出。 |
根据需要选择模型。下文以Qwen3-Omni-30B-A3B-Instruct为例。
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Instruct
pip install modelscope将从魔塔社区拷贝的下载命令行,补充下载目录参数--local_dir,并执行:
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Instruct --local_dir ./Qwen3-Omni-30B-A3B-Instructvllm serve ./Qwen3-Omni-30B-A3B-Instruct/ \
--served-model-name qwen3-Omni \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--enforce-eager \
--port 8106其中,tensor-parallel-size不可小于2
curl http://127.0.0.1:8106/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-Omni",
"messages": [
{"role": "system", "content": " You are a helpful language and speech assistant."},
{"role": "user", "content": "你是谁."}
],
"temperature": 0.7,
"max_tokens": 2048
}'模型输出示例: {"id":"chatcmpl-7df562f798ee469d81b25bab1c845ff8","object":"chat.completion","created":1761979861,"model":"qwen3-Omni","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"我是阿里云研发的多模态超大规模语言模型,我叫通义千问Omni。有什么我可以帮助你的吗?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":54,"completion_tokens":29,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}
curl http://127.0.0.1:8106/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-Omni",
"messages": [
{"role": "system", "content": "You are a helpful language and speech assistant."},
{"role": "user", "content": [
{"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/cough.wav"}},
{"type": "text", "text": "Caption the audio."}
]}
],
"temperature": 0.7,
"max_tokens": 2048
}'模型输出示例: {"id":"chatcmpl-bdbcc4e24cab48989ab25983dfb92c94","object":"chat.completion","created":1761980287,"model":"qwen3-Omni","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"一个人正在咳嗽。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":66,"total_tokens":73,"completion_tokens":7,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}
curl http://127.0.0.1:8106/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-Omni",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}},
{"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}},
{"type": "text", "text": "What can you see and hear? Answer in one sentence."}
]}
]
}'模型输出示例: {"id":"chatcmpl-c99b7a9be6f0452fbc1b49c0168f381e","object":"chat.completion","created":1761980331,"model":"qwen3-Omni","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"图片中展示了四辆豪华和运动型汽车——一辆白色劳斯莱斯、一辆灰色奔驰GLE SUV、一辆红色法拉利Portofino M和一辆白色保时捷911——同时伴有一个人咳嗽的可听见声音。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":6115,"total_tokens":6167,"completion_tokens":52,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}
Qwen3-Omni 的 vLLM serve 目前仅支持 thinker 模型。