| 环境配置 | 配置说明 |
|---|---|
| 硬件配置 | Atlas 800T A2 910B2(64G) |
| 驱动版本 | 25.2.3 |
| CANN版本 | 8.3.RC2 |
| 推理框架 | vllm-ascend |
| 推理镜像 | quay.io/ascend/vllm-ascend:v0.11.0rc2 |
| 部署方式 | 4机 32卡 2+2 PD分离部署 |
可以参考官方文档 “安装指南”->“Verify Multi-Node Communication”章节
每个节点执行如下命令进行检查:
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
每台机器需要是一样的值,建议全0
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch如果不一致,可以通过以下命令,将NPU底层tls校验行为置0(执行完可以使用上面的命令进行检查,是否置成功)
for i in {0..7};do hccn_tool -i $i -tls -s enable 0;done机器间互联检测,可通过本机每张npu卡ping其他主机的npu卡ip地址,能ping通则表示正常。
可在各台机器上执行以下命令获取NPU卡的ip地址
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done卡与卡之间互相ping的指令参考如下:
# Execute on the target node (replace with actual IP)
hccn_tool -i 0 -ping -g address x.x.x.x参考示例如下(v0.11.0rc2 为镜像TAG,可以按需修改):
# 获取方式1:
docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2
# 获取方式2:
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:v0.11.0rc2
# 获取方式3:
docker pull quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc2
指定架构可以参考如下:
docker pull --platform arm64 quay.io/ascend/vllm-ascend:v0.11.0rc2| 配套 | 版本 |
|---|---|
| python | 3.11.13 |
| torch | 2.7.1 |
| torch_npu | 2.7.1 |
| vllm | 0.11.0 |
| vllm-ascend | 0.11.0rc2 |
本文直接使用modelscope社区上传的W8A8权重
如果想自行量化,可以参考官方文档:
Qwen3-235B-A22B W8A8混合量化、W4A8混合量化
# 设置容器名称
export CONTAINER_NAME=Qwen3-235B-A22B-W8A8
# 选择镜像
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc2
# device 可按需挂载,示例为0-7卡
# 挂载目录需包含权重所在路径,如/root/.cache
#阅读后文后,可根据场景选择是否挂载/etc/hccn.conf文件
docker run --rm \
--name $CONTAINER_NAME \
--shm-size=256g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-it $IMAGE bashllmdatadist:需要在启动服务前创建ranktable.json文件,用于多机通信;
参考案例:llmdatadist
如果使用mooncake,可以跳过该章节
1. 检查hccn_tool文件:ranktable.json生成脚本gen_ranktable.sh中,需要执行hccn_tool命令,获取npu相关信息。默认会从 /usr/local/Ascend/driver/tools/hccn_tool路径判断hccn_tool是否存在。如果不存在该文件,可以安装相关依赖包,或者从其他环境拷贝。
**2. 生成ranktable.json:**分别在各节点的容器中,执行以下命令
cd /vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1
# bash gen_ranktable.sh --ips <prefiller_node1_local_ip> <prefiller_node2_local_ip> <decoder_node1_local_ip> <decoder_node2_local_ip> \
--npus-per-node <npu_clips> --network-card-name <nic_name> --prefill-device-cnt <prefiller_npu_clips> --decode-device-cnt <decode_npu_clips>
bash gen_ranktable.sh --ips xxx.xxx.xxx.109 xxx.xxx.xxx.39 xxx.xxx.xxx.95 xxx.xxx.xxx.170 --npus-per-node 8 --network-card-name eth0 --prefill-device-cnt 16 --decode-device-cnt 16在mooncake方式下,无需创建ranktable文件,但当前mooncake依赖hccn.conf文件,该文件位于/etc/hccn.conf。可在启动docker容器时挂载该目录,若未挂载,也可将宿主机中的/etc/hccn.conf文件拷贝至容器内的/etc/目录下。
mooncake需要自行编译 参考案例:v0.11.0-dev:mooncake 或 main:mooncake
#!/bin/bash
nic_name="eth0"
local_ip=`hostname -I|awk -F" " '{print $1}'`
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
# 使用mooncake方式时,需要配置
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
# 使用llmdatadist方式时,需要配置
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json
LOCAL_CKPT_DIR=/root/.cache/models/Qwen3-235B-A22B-W8A8
vllm serve "$LOCAL_CKPT_DIR" \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--tensor-parallel-size 8 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 5964 \
--seed 1024 \
--enforce-eager \
--distributed-executor-backend mp \
--served-model-name Qwen3-235B-A22B-W8A8 \
--max-model-len 40960 \
--max-num-batched-tokens 40960 \
--trust-remote-code \
--gpu-memory-utilization 0.8 \
--quantization ascend \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--no-enable-prefix-caching \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"ascend_scheduler_config":{"enabled":false}}' \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_role": "kv_producer",
"kv_buffer_device": "npu",
"kv_parallel_size": 1,
"kv_port": "21000",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}'
上面示例中的kv-transfer-config使用的是llmdatadist,如果要使用mooncake,可替换为如下内容:
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_producer",
"kv_port": "21000",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 8
},
"decode": {
"dp_size": 1,
"tp_size": 8
}
}
}'#!/bin/bash
nic_name="eth0"
local_ip=`hostname -I|awk -F" " '{print $1}'`
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
# 使用mooncake方式时,需要配置
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
# 使用llmdatadist方式时,需要配置
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json
LOCAL_CKPT_DIR=/root/.cache/models/Qwen3-235B-A22B-W8A8
vllm serve "$LOCAL_CKPT_DIR" \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--tensor-parallel-size 8 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 5964 \
--seed 1024 \
--enforce-eager \
--distributed-executor-backend mp \
--served-model-name Qwen3-235B-A22B-W8A8 \
--max-model-len 40960 \
--max-num-batched-tokens 40960 \
--trust-remote-code \
--gpu-memory-utilization 0.8 \
--quantization ascend \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--no-enable-prefix-caching \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"ascend_scheduler_config":{"enabled":false}}' \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_role": "kv_producer",
"kv_buffer_device": "npu",
"kv_parallel_size": 1,
"kv_port": "21100",
"engine_id": "1",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}'上面示例中的kv-transfer-config使用的是llmdatadist,如果要使用mooncake,可替换为如下内容:
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_producer",
"kv_port": "21100",
"engine_id": "1",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 8
},
"decode": {
"dp_size": 1,
"tp_size": 8
}
}
}'#!/bin/bash
nic_name="eth0"
local_ip=`hostname -I|awk -F" " '{print $1}'`
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
# 使用mooncake方式时,需要配置
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
# 使用llmdatadist方式时,需要配置
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json
LOCAL_CKPT_DIR=/root/.cache/models/Qwen3-235B-A22B-W8A8
vllm serve "$LOCAL_CKPT_DIR" \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--tensor-parallel-size 8 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 5964 \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name Qwen3-235B-A22B-W8A8 \
--max-model-len 40960 \
--max-num-batched-tokens 512 \
--max-num_seqs 16 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--quantization ascend \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--async-scheduling \
--additional-config '{"ascend_scheduler_config":{"enabled":false}}' \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_role": "kv_consumer",
"kv_buffer_device": "npu",
"kv_parallel_size": 1,
"kv_port": "21200",
"engine_id": "2",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}'
上面示例中的kv-transfer-config使用的是llmdatadist,如果要使用mooncake,可替换为如下内容:
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_consumer",
"kv_port": "21200",
"engine_id": "2",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 8
},
"decode": {
"dp_size": 1,
"tp_size": 8
}
}
}'#!/bin/bash
nic_name="eth0"
local_ip=`hostname -I|awk -F" " '{print $1}'`
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
# 使用mooncake方式时,需要配置
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
# 使用llmdatadist方式时,需要配置
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json
LOCAL_CKPT_DIR=/root/.cache/models/Qwen3-235B-A22B-W8A8
vllm serve "$LOCAL_CKPT_DIR" \
--host 0.0.0.0 \
--port 8004 \
--api-server-count 1 \
--tensor-parallel-size 8 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 5964 \
--seed 1024 \
--distributed-executor-backend mp \
--served-model-name Qwen3-235B-A22B-W8A8 \
--max-model-len 40960 \
--max-num-batched-tokens 512 \
--max-num_seqs 16 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--quantization ascend \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--async-scheduling \
--additional-config '{"ascend_scheduler_config":{"enabled":false}}' \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_role": "kv_consumer",
"kv_buffer_device": "npu",
"kv_parallel_size": 1,
"kv_port": "21300",
"engine_id": "3",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}'
上面示例中的kv-transfer-config使用的是llmdatadist,如果要使用mooncake,可替换为如下内容:
--kv-transfer-config \
'{"kv_connector": "MooncakeConnector",
"kv_role": "kv_consumer",
"kv_port": "21300",
"engine_id": "3",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 8
},
"decode": {
"dp_size": 1,
"tp_size": 8
}
}
}'说明:
1)rope-scaling:
当前模型的 config.json 设置为上下文长度最高可达 40960 tokens。为了处理更长的上下文,可以利用 YaRN,这是一种增强模型长度外推的技术,确保在长文本上的最佳性能。
目前,vLLM 仅支持静态 YARN,这意味着无论输入长度如何,缩放因子都保持不变,这可能会影响精度、较短文本的性能,建议按需考虑。
如果想将上下文扩展为128K,可参考如下配置:
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'2)需要使用function call功能,可添加如下配置
--enable-auto-tool-choice
--tool-call-parser hermes分别启动P节点、D节点后,需要在某个P节点上单独启动PD proxy,
cd /vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1
python load_balance_proxy_server_example.py --host 0.0.0.0 --port 9000 --prefiller-hosts xxx.xxx.xxx.109 xxx.xxx.xxx.39 --prefiller-port 8004 8004 --decoder-hosts xxx.xxx.xxx.95 xxx.xxx.xxx.170 --decoder-ports 8004 8004
curl http://localhost:9000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen3-235B-A22B-W8A8",
"messages": [{"role": "user", "content": "你好, 你是谁"}],
"max_tokens": 100,
"stream": false,
"temperature":0.8,
"top_p":0.8}'