GLM-4.5迁移部署指导

介绍

GLM-4.5模型使用专家混合（MoE）架构，是专门为智能体应用设计的基础模型。本文档将介绍该模型在A2双机上的主要验证步骤，包括支持的功能、功能配置、环境准备、多节点部署、性能评估等。

环境准备

环境信息

|驱动固件|25.5.0| |CANN版本|8.5.0| |python版本|3.11.14| |torch版本|2.9.0| |torch_npu版本|2.9.0|

下载模型权重

GLM-4.5：至少需要1个Atlas 800 A3（64G×16）节点或2个Atlas 800 A2（64G×8）节点。下面仅以A2双机为例，本指导假设有主机节点Node0，其ip为 xx.xx.xx.0；从机节点Node1，其ip为xx.xx.xx.1。那么需要分别在主从双节点上都下载好模型权重，或者仅在主节点上下载模型权重并通过nfs共享权重的方式将主节点的权重挂载到从节点上。下面以分别为主从双节点上下载权重为例：

Node0宿主机：

在/tmp下创建share_weight, 并赋予高级权限

mkdir /tmp/share_weight
chmod 777 /tmp/share_weight
cd /tmp/share_weight

确保已安装好huggingface-hub和tqdm

pip install huggingface-hub tqdm

将权重下载到当前目录(/tmp/share_weight)

HF_ENDPOINT=https://hf-mirror.com python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='zai-org/GLM-4.5',
    local_dir='./GLM-4.5',
    resume_download=True,
    local_dir_use_symlinks=False
)
"

注意：如果当前环境的其他节点已经下载好了改权重，则可以通过scp的方式快速拷贝即可。

Node1宿主机： 同Node0宿主机的操作步骤一致。

验证多节点通信

每一节点验证

分别在Node0和Node1上输入下面的命令，可以更具每一行命令的输出结果判断当前两台节点之间的通信是否健康。

 # Check the remote switch ports
 for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done 
 # Get the link status of the Ethernet ports (UP or DOWN)
 for i in {0..7}; do hccn_tool -i $i -link -g ; done
 # Check the network health status
 for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
 # View the network detected IP configuration
 for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
 # View gateway configuration
 for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
 # View NPU network configuration
 cat /etc/hccn.conf

互联验证

获取NPU的IP地址

for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done

跨节点ping测试

hccn_tool -i 0 -ping -g address x.x.x.x

准备镜像

确保当前Node0和Node1宿主机上均有vllm-ascend的镜像。若没有，则可以通过下述命令下载：

docker pull quay.io/ascend/vllm-ascend:v0.14.0rc1

拉取完之后输入docker imags可以查看是否拉取成功。

关闭防火墙

由于涉及到多机部署，若不关闭防火墙，则在docker容器中容易引起Gloo 在随机端口上跨节点直连失败问题，因此需要通过以下命令来检查宿主机上的防火墙状态：

systemctl status firewalld

若没有关闭则需要：

systemctl stop firewalld

启动容器

对于Node0和Node1来说使用同一套启动命令：

export IMAGE=quay.io/ascend/vllm-ascend:v0.14.0rc1

docker run \
--name vllm-ascend-glm \
--net=host \
--shm-size=500g \
--privileged=true \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /tmp/:/tmp/ \
-it $IMAGE bash

其中/tmp目录下存放着本指导测试的模型权重路径，改目录可以替换为实际下载的模型权重路径。且--name vllm-ascend-glm可以替换为指定的模型名。

部署

对于Node0容器来说：在任意路径下创建startNode0.sh脚本，并vim startNode0.sh

#!/bin/sh
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true
# To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxxx"
local_ip="xxxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export HCCL_BUFFSIZE=1024
export TASK_QUEUE_ENABLE=1

vllm serve /tmp/share_weight/GLM-4.5  \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 2 \
--api-server-count 2 \
--data-parallel-size-local 1 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name glm4.5 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \

**注意：**其中的local_ip和nic_name可以通过ifconfig查询；/tmp/share_weight/GLM-4.5 可以替换为实际的模型权重路径；

对于Node1容器来说：在任意路径下创建startNode1.sh脚本，并vim startNode1.sh

#!/bin/sh
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=true
# To reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxxx"
local_ip="xxxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export HCCL_BUFFSIZE=1024
export TASK_QUEUE_ENABLE=1

vllm serve /tmp/share_weight/GLM-4.5 \
--host 0.0.0.0 \
--port 8000 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--tensor-parallel-size 8 \
--served-model-name glm4.5 \
--max-num-seqs 16 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \

**注意：**其中的local_ip和nic_name可以通过ifconfig查询；其中的node0_ip需要设置为Node0的IP地址：xx.xx.xx.0。/tmp/share_weight/GLM-4.5 可以替换为实际的模型权重路径；

分别在两个shell窗口上运行startNode0.sh和startNode1.sh。如有在Node0主节点容器上以下输出则表示服务启动成功：

INFO:     Started server process [44610]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Started server process [44611]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

功能验证

重新起一个shell窗口并进入Node0的容器，执行下述命令：

curl http://<node0_ip>:<port>/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "glm4.5",
        "prompt": "The future of AI is",
        "max_completion_tokens": 50,
        "temperature": 0
    }'

其中node0_ip需要替换为Node0主节点的IP地址，port为起服务时设置的端口号，本指导的port为8000.若有结果返回且服务端有200 OK的INFO，则表示GLM-4.5在A2上正常启动。

精度测试

安装AISbench

在任意目录下执行下面命令：

git clone https://gitee.com/aisbench/benchmark.git
cd benchmark/
pip3 install -e ./ --use-pep517

安装额外的AISbench依赖：

pip3 install -r requirements/api.txt
pip3 install -r requirements/extra.txt

运行ais_bench -h 来检视是否安装成功。

下载数据集

以C-Eval数据集为例,下载数据集并将其安装到指定路径:

cd ais_bench/datasets
mkdir ceval/
mkdir ceval/formal_ceval
cd ceval/formal_ceval
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
unzip ceval-exam.zip

编辑配置文件

参数说明： attr: 推理后端类型的标识符，固定为 service（基于服务的推理）或 local（本地模型）。 type: 用于选择不同的后端 API 类型。 abbr: 本地任务的唯一标识符，用于区分多个任务。 path: 更新为你的模型权重路径。 model: 更新为 vLLM 中的模型名称。 host_ip 和 host_port: 更新为你的 vLLM 服务器的 IP 和端口。 max_out_len: 请注意 max_out_len + LLM 输入长度应小于 vllm 服务器中的 max-model-len（config 配置项），32768 适合大多数数据集。 batch_size: 根据你的数据集进行更新。 temperature: 更新推理参数。

from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr='vllm-api-general-chat',
        path="xxxx",
        model="xxxx",
        request_rate = 0,
        retry = 2,
        host_ip = "localhost",
        host_port = 8000,
        max_out_len = xxx,
        batch_size = xxx,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0.6,
            top_k = 10,
            top_p = 0.95,
            seed = None,
            repetition_penalty = 1.03,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    )
]

其中需要修改的参数是path、model、host_ip、max_out_len和batch_size。其中本指导使用的max_out_len和batch_size分别为16384和32.

执行命令

# run C-Eval dataset
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds

性能测试

在容器中输入下述命令：

vllm bench serve   --backend vllm   --dataset-name prefix_repetition   --prefix-repetition-prefix-len 11200   --prefix-repetition-suffix-len 9600   --prefix-repetition-output-len 1024   --num-prompts 1   --prefix-repetition-num-prefixes 1   --ignore-eos   --model glm4.5   --tokenizer /tmp/share_weight/GLM-4.5   --seed 1000   --host 0.0.0.0   --port 8000   --endpoint /v1/completions   --max-concurrency 1   --request-rate 1

若执行正常则可以看到测试结果。