| 环境配置 | 配置说明 |
|---|---|
| 硬件配置 | Atlas 800T A2 910B2(64G) |
| 驱动版本 | 25.2.3 |
| CANN版本 | 8.3.RC2 |
| 推理框架 | vllm-ascend |
| 推理镜像 | quay.io/ascend/vllm-ascend:v0.12.0rc1 |
| 部署方式 | 单机4卡 |
参考示例如下(v0.12.0rc1 为镜像TAG,可以按需修改):
# 获取方式1:
docker pull quay.io/ascend/vllm-ascend:v0.12.0rc1
# 获取方式2:
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:v0.12.0rc1
# 获取方式3:
docker pull quay.nju.edu.cn/ascend/vllm-ascend:v0.12.0rc1
指定架构可以参考如下:
docker pull --platform arm64 quay.io/ascend/vllm-ascend:v0.12.0rc1| 配套 | 版本 |
|---|---|
| python | 3.11.13 |
| torch | 2.8.0 |
| torch_npu | 2.8.0 |
| vllm | 0.12.0 |
| vllm-ascend | 0.12.0rc1 |
可从下面地址进行下载:
modelscope社区权重
# 设置容器名称
export CONTAINER_NAME=Qwen3-Next-80B-A3B-Instruct
# 选择镜像
export IMAGE=quay.io/ascend/vllm-ascend:v0.12.0rc1
# device 可按需挂载。示例为0-3卡
# 挂载目录需包含权重所在路径,如/root/.cache
docker run --rm \
--name $CONTAINER_NAME \
--shm-size=256g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash进入推理容器后,执行以下操作
# 镜像中已有bisheng
source /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh
# 下载 triton_ascend
wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27_aarch64.whl
# 安装 triton_ascend
pip install triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27_aarch64.whl
进入推理容器后,执行以下操作,启动推理服务
source /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export HCCL_OP_EXPANSION_MODE="AIV"
LOCAL_MODELS_DIR=/root/.cache/models/Qwen3-Next-80B-A3B-Instruct
vllm serve "$LOCAL_MODELS_DIR" \
--served-model-name Qwen3-Next-80B-A3B-Instruct \
--max_model_len 40960 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.7 \
--host 0.0.0.0 \
--port 8888 \
--async-scheduling \
--distributed_executor_backend "mp" \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-Next-80B-A3B-Instruct",
"messages": [
{"role": "user", "content": "Who are you?"}
],
"temperature": 0.5,
"max_tokens": 50,
"stream": false
}'