概述

本案例提供了在A+X环境下基于vLLM部署DeepSeek-V3.2-Exp模型的实践方案。

环境依赖

安装关键软件及依赖时，各组件的版本号必须匹配。

OS：ubuntu 22.0
CANN: 8.3.RC1
Chip: 910B
Arch：x86_64
Torch:2.7.1
Torch_npu:2.7.1
vllm:0.11.0
vllm_ascend:0.11.0rc2

部署卡数：至少需要32张910B卡，本文以双节点部署为例，每个节点配备16张NPU。

下载模型原文件

https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16

自定义算子编译(custom-ops)

一、x86平台请按照以下指导完成自定义算子编译： https://gitcode.com/cann/cann-recipes-infer/blob/master/ops/ascendc/README.md 注意进行如下改动：

bash build.sh --disable-check-compatible -c ascend910b

bash build_and_install.sh -c ascend910b

添加--disable-check-compatible选项，用于关闭版本强制检查，确保CANN版本为8.3RC1即可。

添加-c ascend910b参数，明确指定芯片型号，请根据实际使用的芯片型号进行修改。

最后确保5个测试用例（testcase）均测试通过。

对于910B芯片，可直接运行上述命令进行安装。其他芯片型号请替换-c参数。

mkdir -p /home/code; cd /home/code/
git clone https://gitcode.com/cann/cann-recipes-infer.git
cd cann-recipes-infer
source /usr/local/Ascend/ascend-toolkit/set_env.sh
cd /home/code/cann-recipes-infer/ops/ascendc
bash build.sh --disable-check-compatible -c ascend910b
cd /home/code/cann-recipes-infer/ops/ascendc/output
chmod +x CANN-custom_ops--linux.x86_64.run
./CANN-custom_ops--linux.x86_64.run --quiet --install-path=/usr/local/Ascend/ascend-toolkit/latest/opp
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash 
ls /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/
cd /home/code/cann-recipes-infer/ops/ascendc/torch_ops_extension
bash build_and_install.sh  -c ascend910b
cd dist
pip install custom_ops-1.0-cp311-cp311-linux_x86_64.whl --force-reinstall
pip install expecttest
cd /home/code/cann-recipes-infer/ops/ascendc/examples
python3 test_npu_lightning_indexer.py
python3 test_npu_mla_prolog_v3.py
python3 test_npu_swiglu_clip_quant.py
python3 test_npu_sparse_flash_attention_antiquant.py
python3 test_npu_sparse_flash_attention.py

二、Arm平台根据如下指导完成算子编译：

https://github.com/vllm-project/vllm-ascend/issues/3278

也可参考：

https://docs.vllm.ai/projects/ascend/zh-cn/latest/tutorials/DeepSeek-V3.2-Exp.html

模型部署配置

主节点配置

#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="eth0"
local_ip="10.244.81.33"

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export HCCL_BUFFSIZE=1024
export VLLM_ENGINE_INIT_TIMEOUT=3600
export ENGINE_INIT_TIMEOUT=3600  
export HCCL_EXEC_TIMEOUT=7200
export HCCL_CONNECT_TIMEOUT=3600


vllm serve  /path/DeepSeek-V3.2-Exp-BF16 \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 16 \
--seed 1024 \
--served-model-name deepseek_v32 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'

环境变量说明：
nic_name="xxxx" 和 local_ip="xxxx"：当前节点使用的网络接口名称和 IP 地址，推荐eth、enp等接口及IP地址。确保所有分布式通信使用指定的高速网络接口，避免使用慢速网络。
export HCCL_IF_IP=$local_ip：HCCL使用的IP。
export GLOO_SOCKET_IFNAME=$nic_name：Gloo 库使用的网络接口名称。
export TP_SOCKET_IFNAME=$nic_name：设置TP使用的网络接口名称。
export HCCL_SOCKET_IFNAME=$nic_name：设置 HCCL 使用的网络接口名称。
export OMP_PROC_BIND=false：禁用 OpenMP 进程绑定。
export OMP_NUM_THREADS=1：设置 OpenMP 线程数。
export HCCL_BUFFSIZE=1024：设置 HCCL 的通信缓冲区大小。
参数说明：
vllm serve /model_path/DeepSeek-V3.2-Exp-BF16   ：指定模型路径。
--host 0.0.0.0 --port 8000：指定服务端口号。
--data-parallel-size 2：双节点总共有2个数据并行组，和--data-parallel-size-local强依赖，数量为--data-parallel-size-local*节点数。
--data-parallel-size-local 1：单节点运行1个数据并行组。和--data-parallel-size 强依赖。
--data-parallel-address $local_ip：指定主节点与其他节点上DP组通信的IP。
--data-parallel-rpc-port 13389：DP组之间使用该RPC端口进行通信。
--seed 1024 ：随机种子，保证 reproducibility
--tensor-parallel-size 16：TP为16 
--enable-expert-parallel：使能EP
--max-num-seqs 16：同时处理的最大序列数，也即并发。
--max-model-len 16384：模型处理的最大上下文长度。
--trust-remote-code：信任远程代码执行
--gpu-memory-utilization 0.9：每个NPU推理部署使用显存的最大比例。
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'   配置昇腾调度及torchair图模式

从节点配置

#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="eth0"
local_ip="10.244.55.39"
node0_ip="10.244.81.33"

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/bin/set_env.bash
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export HCCL_BUFFSIZE=1024
export VLLM_ENGINE_INIT_TIMEOUT=3600
export ENGINE_INIT_TIMEOUT=3600  
export HCCL_EXEC_TIMEOUT=7200
export HCCL_CONNECT_TIMEOUT=3600


vllm serve /inspire/sj-ssd/project/pretrain-test/public/workspace/models/ds3.2-exp-from-modelers/DeepSeek-V3.2-Exp-BF16 \
--host 0.0.0.0 \
--port 8000 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 16 \
--seed 1024 \
--served-model-name deepseek_v32 \
--max-num-seqs 16 \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}' \
2>&1 | tee ./infer_$(date +%Y%m%d_%H%M%S).log

关键配置说明：
node0_ip="xxxx"：关键环境变量。指定主节点的IP地址。这是从节点与主节点建立通信的地址。
--headless：关键参数。此参数表示该节点是无头节点，不启动 API 服务处理外部请求，只作为集群的工作节点，由主节点统一调度。
--data-parallel-address $node0_ip：关键参数。指定主节点的 IP 地址，所有从节点都通过这个地址与主节点进行通信。
从节点执行该脚本，使用--headless模式启动vLLM，通过--data-parallel-address $node0_ip和--data-parallel-rpc-port 13389连接到主节点，由主节点负责请求分发。

概述