1.准备运行环境

1.1 环境准备

环境配置	配置说明
硬件配置	Atlas A2 910B4(64G)
驱动版本	25.2.3
CANN版本	8.5.1
推理框架	vllm-ascend
推理镜像	quay.io/ascend/vllm-ascend:v0.17.0rc1
部署方式	1卡部署

1.2 镜像及组合制作及安装

本文使用官方提供镜像，通过docker pull方式进行拉取。

参考示例如下（main 为镜像TAG，可以按需修改）：

# 获取方式1：
docker pull quay.io/ascend/vllm-ascend:v0.17.0rc1

# 获取方式2：
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:v0.17.0rc1

# 获取方式3：
docker pull quay.nju.edu.cn/ascend/vllm-ascend:v0.17.0rc1

如果需要指定架构，可参考以下命令：

docker pull --platform arm64 quay.io/ascend/vllm-ascend:v0.17.0rc1

如果该方案不适用，可参考官方文档进行手动安装

1.3 相关依赖版本信息

配套	版本
python	3.11.14
torch	2.9.0+cpu
torch_npu	2.9.0
vllm	0.17.0
vllm-ascend	0.17.0rc1

2.模型权重准备

可从下列参考地址进行下载：
modelscope社区：

Qwen3-Omni-30B-A3B-Thinking

下载命令参考：

# 如果环境中没有安装modelscope，先执行以下命令
pip install modelscope

# --local_dir：按需修改为指定的存储路径，如/root/.cache/models/

modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Thinking  --local_dir /root/.cache/models/Qwen3-Omni-30B-A3B-Thinking

3. 部署实践

3.1 启动推理容器

# device 可按需挂载。示例为0卡 (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15])
# 挂载目录需包含权重所在路径，示例为/root/.cache，可以根据实际情况指定

# 设置容器名称
export CONTAINER_NAME=Qwen3-omni
# 选择镜像
export IMAGE=quay.io/ascend/vllm-ascend:v0.17.0rc1


docker run --rm \
    --name $CONTAINER_NAME \
    --shm-size=50g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

3.2 启动推理服务

本文中配置了prefix-caching-hash-algo xxhash，所以进入推理容器后，执行以下操作安装xxhash（可选）

pip install xxhash

进入推理容器后，执行以下操作，启动推理服务

export ASCEND_RT_VISIBLE_DEVICES=0,1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=16
export VLLM_USE_V1=1
export CPU_AFFINITY_CONF=1
export TASK_QUEUE_ENABLE=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

export MIES_USE_MB_SWAPPER=1
#export VLLM_TORCH_PROFILER_DIR="./vllm_profile"
#export VLLM_TORCH_PROFILER_WITH_STACK=0
#export VLLM_ASCEND_ENABLE_NZ=2

MODEL_PATH="/root/.cache/models/Qwen3-Omni-30B-A3B-Thinking"

vllm serve $MODEL_PATH \
        --host 0.0.0.0 \
        --port 8000 \
        --tensor-parallel-size 2 \
        --gpu-memory-utilization 0.9  \
        --max-model-len 32768 \
        --block-size 128 \
        --allowed-local-media-path / \
        --async-scheduling \
        --enable-prefix-caching \
        --prefix-caching-hash-algo xxhash \
        --served-model-name qw3-omni \
        --mm_processor_cache_type="shm" \
        --compilation-config '{"cudagraph_mode": "FULL"}' \
        > qw3-omni-serve.log 2>&1 &

3.3 进行推理测试

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}},
        {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}},
        {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
    ]}
    ]
    }'

测试结果：

{"id":"chatcmpl-8b5311c6d0234ca2","object":"chat.completion","created":1774073875,"model":"qw3-omni","choices":[{"index":0,"message":{"role":"assistant","content
":"

1.准备运行环境

1.1 环境准备

环境配置	配置说明
硬件配置	Atlas A2 910B4(64G)
驱动版本	25.2.3
CANN版本	8.5.1
推理框架	vllm-ascend
推理镜像	quay.io/ascend/vllm-ascend:v0.17.0rc1
部署方式	1卡部署

1.2 镜像及组合制作及安装

本文使用官方提供镜像，通过docker pull方式进行拉取。

参考示例如下（main 为镜像TAG，可以按需修改）：

# 获取方式1：
docker pull quay.io/ascend/vllm-ascend:v0.17.0rc1

# 获取方式2：
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:v0.17.0rc1

# 获取方式3：
docker pull quay.nju.edu.cn/ascend/vllm-ascend:v0.17.0rc1

如果需要指定架构，可参考以下命令：

docker pull --platform arm64 quay.io/ascend/vllm-ascend:v0.17.0rc1

如果该方案不适用，可参考官方文档进行手动安装

1.3 相关依赖版本信息

配套	版本
python	3.11.14
torch	2.9.0+cpu
torch_npu	2.9.0
vllm	0.17.0
vllm-ascend	0.17.0rc1

2.模型权重准备

可从下列参考地址进行下载：
modelscope社区：

Qwen3-Omni-30B-A3B-Thinking

下载命令参考：

# 如果环境中没有安装modelscope，先执行以下命令
pip install modelscope

# --local_dir：按需修改为指定的存储路径，如/root/.cache/models/

modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Thinking  --local_dir /root/.cache/models/Qwen3-Omni-30B-A3B-Thinking

3. 部署实践

3.1 启动推理容器

# device 可按需挂载。示例为0卡 (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15])
# 挂载目录需包含权重所在路径，示例为/root/.cache，可以根据实际情况指定

# 设置容器名称
export CONTAINER_NAME=Qwen3-omni
# 选择镜像
export IMAGE=quay.io/ascend/vllm-ascend:v0.17.0rc1


docker run --rm \
    --name $CONTAINER_NAME \
    --shm-size=50g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

3.2 启动推理服务

本文中配置了prefix-caching-hash-algo xxhash，所以进入推理容器后，执行以下操作安装xxhash（可选）

pip install xxhash

进入推理容器后，执行以下操作，启动推理服务

export ASCEND_RT_VISIBLE_DEVICES=0,1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=16
export VLLM_USE_V1=1
export CPU_AFFINITY_CONF=1
export TASK_QUEUE_ENABLE=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

export MIES_USE_MB_SWAPPER=1
#export VLLM_TORCH_PROFILER_DIR="./vllm_profile"
#export VLLM_TORCH_PROFILER_WITH_STACK=0
#export VLLM_ASCEND_ENABLE_NZ=2

MODEL_PATH="/root/.cache/models/Qwen3-Omni-30B-A3B-Thinking"

vllm serve $MODEL_PATH \
        --host 0.0.0.0 \
        --port 8000 \
        --tensor-parallel-size 2 \
        --gpu-memory-utilization 0.9  \
        --max-model-len 32768 \
        --block-size 128 \
        --allowed-local-media-path / \
        --async-scheduling \
        --enable-prefix-caching \
        --prefix-caching-hash-algo xxhash \
        --served-model-name qw3-omni \
        --mm_processor_cache_type="shm" \
        --compilation-config '{"cudagraph_mode": "FULL"}' \
        > qw3-omni-serve.log 2>&1 &

3.3 进行推理测试

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}},
        {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}},
        {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
    ]}
    ]
    }'

测试结果：

{"id":"chatcmpl-8b5311c6d0234ca2","object":"chat.completion","created":1774073875,"model":"qw3-omni","choices":[{"index":0,"message":{"role":"assistant","content
":"