Ascend-SACT/Qwen3-235B-A22B-W8A8
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

1.准备运行环境

1.1 环境准备

环境配置配置说明
硬件配置Atlas 800T A2 910B2(64G)
驱动版本25.2.3
CANN版本8.3.RC2
推理框架vllm-ascend
推理镜像quay.io/ascend/vllm-ascend:v0.11.0rc2
部署方式4机 32卡 2+2 PD分离部署

1.2 多机部署通信检查

可以参考官方文档 “安装指南”->“Verify Multi-Node Communication”章节

1.2.1 检查机器网络情况

每个节点执行如下命令进行检查:

# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done 
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf

1.2.2 检测NPU底层TLS行为一致性

每台机器需要是一样的值,建议全0

for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch

如果不一致,可以通过以下命令,将NPU底层tls校验行为置0(执行完可以使用上面的命令进行检查,是否置成功)

for i in {0..7};do hccn_tool -i $i -tls -s enable 0;done

1.2.3 检查机器间互联情况

机器间互联检测,可通过本机每张npu卡ping其他主机的npu卡ip地址,能ping通则表示正常。

可在各台机器上执行以下命令获取NPU卡的ip地址

for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done

卡与卡之间互相ping的指令参考如下:

# Execute on the target node (replace with actual IP)
hccn_tool -i 0 -ping -g address x.x.x.x

1.3 镜像及组合制作及安装

  1. 本文使用官方提供镜像,通过docker pull方式进行拉取。

参考示例如下(v0.11.0rc2 为镜像TAG,可以按需修改):

# 获取方式1:
docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2

# 获取方式2:
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:v0.11.0rc2

# 获取方式3:
docker pull quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc2

指定架构可以参考如下:

docker pull --platform arm64 quay.io/ascend/vllm-ascend:v0.11.0rc2
  1. 如果该方案不适用,可参考官方文档进行手动安装

1.4 相关依赖版本信息

配套版本
python3.11.13
torch2.7.1
torch_npu2.7.1
vllm0.11.0
vllm-ascend0.11.0rc2

2.模型权重准备

2.1 获取权重

本文直接使用modelscope社区上传的W8A8权重

2.2 其他权重

如果想自行量化,可以参考官方文档:

Qwen3-235B-A22B W8A8混合量化、W4A8混合量化

3. 多机PD分离部署实践

3.1 在各节点分别启动推理容器

# 设置容器名称
export CONTAINER_NAME=Qwen3-235B-A22B-W8A8
# 选择镜像
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc2
# device 可按需挂载,示例为0-7卡
# 挂载目录需包含权重所在路径,如/root/.cache
#阅读后文后,可根据场景选择是否挂载/etc/hccn.conf文件

docker run --rm \
    --name $CONTAINER_NAME \
    --shm-size=256g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /etc/hccn.conf:/etc/hccn.conf \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

3.2 两种kv_connector部署相关说明

3.2.1 llmdatadist

llmdatadist:需要在启动服务前创建ranktable.json文件,用于多机通信;
参考案例:llmdatadist

3.2.1.1 创建ranktable(非mooncake方式)

如果使用mooncake,可以跳过该章节

1. 检查hccn_tool文件:ranktable.json生成脚本gen_ranktable.sh中,需要执行hccn_tool命令,获取npu相关信息。默认会从 /usr/local/Ascend/driver/tools/hccn_tool路径判断hccn_tool是否存在。如果不存在该文件,可以安装相关依赖包,或者从其他环境拷贝。

**2. 生成ranktable.json:**分别在各节点的容器中,执行以下命令

cd /vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1

# bash gen_ranktable.sh --ips <prefiller_node1_local_ip> <prefiller_node2_local_ip> <decoder_node1_local_ip> <decoder_node2_local_ip> \
  --npus-per-node  <npu_clips> --network-card-name <nic_name> --prefill-device-cnt <prefiller_npu_clips> --decode-device-cnt <decode_npu_clips>


bash gen_ranktable.sh --ips xxx.xxx.xxx.109    xxx.xxx.xxx.39    xxx.xxx.xxx.95   xxx.xxx.xxx.170 --npus-per-node  8 --network-card-name eth0 --prefill-device-cnt 16 --decode-device-cnt 16

3.2.2 mooncake

  1. 在mooncake方式下,无需创建ranktable文件,但当前mooncake依赖hccn.conf文件,该文件位于/etc/hccn.conf。可在启动docker容器时挂载该目录,若未挂载,也可将宿主机中的/etc/hccn.conf文件拷贝至容器内的/etc/目录下。

  2. mooncake需要自行编译 参考案例:v0.11.0-dev:mooncake 或 main:mooncake

3.3 进入各节点容器分别启动对应推理服务

3.3.1 Prefill 1 实例

#!/bin/bash
nic_name="eth0"
local_ip=`hostname -I|awk -F" " '{print $1}'`

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

# 使用mooncake方式时,需要配置
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH

# 使用llmdatadist方式时,需要配置
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json

LOCAL_CKPT_DIR=/root/.cache/models/Qwen3-235B-A22B-W8A8

vllm serve "$LOCAL_CKPT_DIR" \
  --host 0.0.0.0 \
  --port 8004 \
  --api-server-count 1 \
  --tensor-parallel-size 8 \
  --data-parallel-address $local_ip \
  --data-parallel-rpc-port 5964  \
  --seed 1024 \
  --enforce-eager \
  --distributed-executor-backend mp \
  --served-model-name Qwen3-235B-A22B-W8A8 \
  --max-model-len 40960 \
  --max-num-batched-tokens 40960 \
  --trust-remote-code \
  --gpu-memory-utilization 0.8 \
  --quantization ascend \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --no-enable-prefix-caching \
  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --additional-config '{"ascend_scheduler_config":{"enabled":false}}' \
  --kv-transfer-config \
  '{"kv_connector": "LLMDataDistCMgrConnector",
  "kv_role": "kv_producer",
  "kv_buffer_device": "npu",
  "kv_parallel_size": 1,
  "kv_port": "21000",
  "engine_id": "0",
  "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
  }'

上面示例中的kv-transfer-config使用的是llmdatadist,如果要使用mooncake,可替换为如下内容:

--kv-transfer-config \
  '{"kv_connector": "MooncakeConnector",
  "kv_role": "kv_producer",
  "kv_port": "21000",
  "engine_id": "0",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 1,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 1,
                    "tp_size": 8
             }
      }
  }'

3.3.2 Prefill 2 实例

#!/bin/bash

nic_name="eth0"
local_ip=`hostname -I|awk -F" " '{print $1}'`

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

# 使用mooncake方式时,需要配置
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH

# 使用llmdatadist方式时,需要配置
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json

LOCAL_CKPT_DIR=/root/.cache/models/Qwen3-235B-A22B-W8A8

vllm serve "$LOCAL_CKPT_DIR" \
  --host 0.0.0.0 \
  --port 8004 \
  --api-server-count 1 \
  --tensor-parallel-size 8 \
  --data-parallel-address $local_ip \
  --data-parallel-rpc-port 5964  \
  --seed 1024 \
  --enforce-eager \
  --distributed-executor-backend mp \
  --served-model-name Qwen3-235B-A22B-W8A8 \
  --max-model-len 40960 \
  --max-num-batched-tokens 40960 \
  --trust-remote-code \
  --gpu-memory-utilization 0.8 \
  --quantization ascend \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --no-enable-prefix-caching \
  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --additional-config '{"ascend_scheduler_config":{"enabled":false}}' \
  --kv-transfer-config \
  '{"kv_connector": "LLMDataDistCMgrConnector",
  "kv_role": "kv_producer",
  "kv_buffer_device": "npu",
  "kv_parallel_size": 1,
  "kv_port": "21100",
  "engine_id": "1",
  "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
  }'

上面示例中的kv-transfer-config使用的是llmdatadist,如果要使用mooncake,可替换为如下内容:

 --kv-transfer-config \
  '{"kv_connector": "MooncakeConnector",
  "kv_role": "kv_producer",
  "kv_port": "21100",
  "engine_id": "1",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 1,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 1,
                    "tp_size": 8
             }
      }
  }'

3.3.3 Decode 1 实例

#!/bin/bash
nic_name="eth0"
local_ip=`hostname -I|awk -F" " '{print $1}'`

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

# 使用mooncake方式时,需要配置
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH

# 使用llmdatadist方式时,需要配置
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json

LOCAL_CKPT_DIR=/root/.cache/models/Qwen3-235B-A22B-W8A8

vllm serve "$LOCAL_CKPT_DIR" \
  --host 0.0.0.0 \
  --port 8004 \
  --api-server-count 1 \
  --tensor-parallel-size 8 \
  --data-parallel-address $local_ip \
  --data-parallel-rpc-port 5964  \
  --seed 1024 \
  --distributed-executor-backend mp \
  --served-model-name Qwen3-235B-A22B-W8A8 \
  --max-model-len 40960 \
  --max-num-batched-tokens 512 \
  --max-num_seqs 16 \
  --trust-remote-code \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.9 \
  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --quantization ascend \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --async-scheduling \
  --additional-config '{"ascend_scheduler_config":{"enabled":false}}' \
  --kv-transfer-config \
  '{"kv_connector": "LLMDataDistCMgrConnector",
  "kv_role": "kv_consumer",
  "kv_buffer_device": "npu",
  "kv_parallel_size": 1,
  "kv_port": "21200",
  "engine_id": "2",
  "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
  }'

上面示例中的kv-transfer-config使用的是llmdatadist,如果要使用mooncake,可替换为如下内容:

  --kv-transfer-config \
  '{"kv_connector": "MooncakeConnector",
  "kv_role": "kv_consumer",
  "kv_port": "21200",
  "engine_id": "2",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 1,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 1,
                    "tp_size": 8
             }
      }
  }'

3.3.4 Decode 2 实例

#!/bin/bash

nic_name="eth0"
local_ip=`hostname -I|awk -F" " '{print $1}'`

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10

export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

# 使用mooncake方式时,需要配置
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH

# 使用llmdatadist方式时,需要配置
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json

LOCAL_CKPT_DIR=/root/.cache/models/Qwen3-235B-A22B-W8A8

vllm serve "$LOCAL_CKPT_DIR" \
  --host 0.0.0.0 \
  --port 8004 \
  --api-server-count 1 \
  --tensor-parallel-size 8 \
  --data-parallel-address $local_ip \
  --data-parallel-rpc-port 5964  \
  --seed 1024 \
  --distributed-executor-backend mp \
  --served-model-name Qwen3-235B-A22B-W8A8 \
  --max-model-len 40960 \
  --max-num-batched-tokens 512 \
  --max-num_seqs 16 \
  --trust-remote-code \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.9 \
  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --quantization ascend \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --async-scheduling \
  --additional-config '{"ascend_scheduler_config":{"enabled":false}}' \
  --kv-transfer-config \
  '{"kv_connector": "LLMDataDistCMgrConnector",
  "kv_role": "kv_consumer",
  "kv_buffer_device": "npu",
  "kv_parallel_size": 1,
  "kv_port": "21300",
  "engine_id": "3",
  "kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
  }'

上面示例中的kv-transfer-config使用的是llmdatadist,如果要使用mooncake,可替换为如下内容:

  --kv-transfer-config \
  '{"kv_connector": "MooncakeConnector",
  "kv_role": "kv_consumer",
  "kv_port": "21300",
  "engine_id": "3",
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 1,
                    "tp_size": 8
             },
             "decode": {
                    "dp_size": 1,
                    "tp_size": 8
             }
      }
  }'

说明:
1)rope-scaling: 当前模型的 config.json 设置为上下文长度最高可达 40960 tokens。为了处理更长的上下文,可以利用 YaRN,这是一种增强模型长度外推的技术,确保在长文本上的最佳性能。 目前,vLLM 仅支持静态 YARN,这意味着无论输入长度如何,缩放因子都保持不变,这可能会影响精度、较短文本的性能,建议按需考虑。

如果想将上下文扩展为128K,可参考如下配置:

--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'

2)需要使用function call功能,可添加如下配置

--enable-auto-tool-choice
--tool-call-parser hermes

3.3.5 proxy分发实例

分别启动P节点、D节点后,需要在某个P节点上单独启动PD proxy,

cd /vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1

python load_balance_proxy_server_example.py --host 0.0.0.0 --port 9000 --prefiller-hosts xxx.xxx.xxx.109 xxx.xxx.xxx.39 --prefiller-port 8004 8004 --decoder-hosts xxx.xxx.xxx.95 xxx.xxx.xxx.170 --decoder-ports 8004 8004

3.4 进行推理测试

curl http://localhost:9000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen3-235B-A22B-W8A8",
"messages": [{"role": "user", "content": "你好, 你是谁"}],
"max_tokens": 100,
"stream": false,
"temperature":0.8,
"top_p":0.8}'