| 环境配置 | 配置说明 |
|---|---|
| 硬件配置 | Atlas A2 910B4(64G) |
| 驱动版本 | 25.2.3 |
| CANN版本 | 8.5.1 |
| 推理框架 | vllm-ascend |
| 推理镜像 | quay.io/ascend/vllm-ascend:v0.17.0rc1 |
| 部署方式 | 1卡 部署 |
参考示例如下(main 为镜像TAG,可以按需修改):
# 获取方式1:
docker pull quay.io/ascend/vllm-ascend:v0.17.0rc1
# 获取方式2:
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:v0.17.0rc1
# 获取方式3:
docker pull quay.nju.edu.cn/ascend/vllm-ascend:v0.17.0rc1
如果需要指定架构,可参考以下命令:
docker pull --platform arm64 quay.io/ascend/vllm-ascend:v0.17.0rc1| 配套 | 版本 |
|---|---|
| python | 3.11.14 |
| torch | 2.9.0+cpu |
| torch_npu | 2.9.0 |
| vllm | 0.17.0 |
| vllm-ascend | 0.17.0rc1 |
可从下列参考地址进行下载:
modelscope社区:
下载命令参考:
# 如果环境中没有安装modelscope,先执行以下命令
pip install modelscope
# --local_dir:按需修改为指定的存储路径,如/root/.cache/models/
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Thinking --local_dir /root/.cache/models/Qwen3-Omni-30B-A3B-Thinking # device 可按需挂载。示例为0卡 (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15])
# 挂载目录需包含权重所在路径,示例为/root/.cache,可以根据实际情况指定
# 设置容器名称
export CONTAINER_NAME=Qwen3-omni
# 选择镜像
export IMAGE=quay.io/ascend/vllm-ascend:v0.17.0rc1
docker run --rm \
--name $CONTAINER_NAME \
--shm-size=50g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash本文中配置了prefix-caching-hash-algo xxhash,所以进入推理容器后,执行以下操作安装xxhash(可选)
pip install xxhash进入推理容器后,执行以下操作,启动推理服务
export ASCEND_RT_VISIBLE_DEVICES=0,1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=16
export VLLM_USE_V1=1
export CPU_AFFINITY_CONF=1
export TASK_QUEUE_ENABLE=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export MIES_USE_MB_SWAPPER=1
#export VLLM_TORCH_PROFILER_DIR="./vllm_profile"
#export VLLM_TORCH_PROFILER_WITH_STACK=0
#export VLLM_ASCEND_ENABLE_NZ=2
MODEL_PATH="/root/.cache/models/Qwen3-Omni-30B-A3B-Thinking"
vllm serve $MODEL_PATH \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9 \
--max-model-len 32768 \
--block-size 128 \
--allowed-local-media-path / \
--async-scheduling \
--enable-prefix-caching \
--prefix-caching-hash-algo xxhash \
--served-model-name qw3-omni \
--mm_processor_cache_type="shm" \
--compilation-config '{"cudagraph_mode": "FULL"}' \
> qw3-omni-serve.log 2>&1 &
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}},
{"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}},
{"type": "text", "text": "What can you see and hear? Answer in one sentence."}
]}
]
}'
测试结果:
{"id":"chatcmpl-8b5311c6d0234ca2","object":"chat.completion","created":1774073875,"model":"qw3-omni","choices":[{"index":0,"message":{"role":"assistant","content
":"