GLM-5.1 是我们面向智能体工程(agentic engineering)的下一代旗舰模型,其代码能力相比前代显著增强。
本模型是通过msmodelslim量化后的模型,模型地址:https://www.modelscope.cn/models/Eco-Tech/GLM-5.1-w8a8
资源配置:2台16卡910B X86服务器,采用DP=2,TP=16进行部署
| 组件 | 版本 |
|---|---|
| 硬件环境 | 910B(16卡)* 2 |
| cann 驱动 | 25.0.rc1.1 |
本环境采用镜像安装,镜像位置 swr.cn-north-4.myhuaweicloud.com/ascend-sact/ascend-910b-ubuntu:v3.5,对应的软件版本由镜像决定,当前镜像版本中各组件版本如下:
| 组件 | 版本 |
|---|---|
| OS | Ubuntu 24.04 x86_64 |
| Python | 3.11.15 |
| cann | 8.5.1 |
| torch_npu | 2.9.0 |
| vllm_ascend | 0.18.0rc1 |
| triton-ascend | 3.2.0 |
命令:
docker pull swr.cn-north-4.myhuaweicloud.com/ascend-sact/ascend-910b-ubuntu:v3.5 具内容可参考: https://gitcode.com/Ascend-SACT/ascend-docker
命令:
conda activate ascend-infer命令:
modelscope download --model Eco-Tech/GLM-5.1-w8a8 --local_dir .命令:
export PORT=${SERVE_PORT} # 服务端口,可根据情况修改
export LWS_LEADER_ADDRESS="10.244.235.4" # Master节点IP地址,修改为主节点IP地址
export LWS_WORKER_INDEX=0 # 0表示Master节点,1表示Worker节点命令:
export PORT=${SERVE_PORT} # 服务端口,可根据情况修改
export LWS_LEADER_ADDRESS="10.244.235.4" # Master节点IP地址,修改为主节点IP地址
export LWS_WORKER_INDEX=1 # 0表示Master节点,1表示Worker节点命令:
#!/bin/bash
# 自动检测本机IP和网络接口
LOCAL_IP=$(hostname -I | awk '{print $1}')
NIC_NAME=$(route -n 2>/dev/null | grep '^0.0.0.0' | awk '{print $8}' | head -1)
# 如果route命令失败,尝试从/sys/class/net获取
if [ -z "$NIC_NAME" ]; then
NIC_NAME=$(ls /sys/class/net/ | grep -v lo | head -1)
fi
export HCCL_IF_IP=${LOCAL_IP}
export GLOO_SOCKET_IFNAME=${NIC_NAME}
export TP_SOCKET_IFNAME=${NIC_NAME}
export HCCL_SOCKET_IFNAME=${NIC_NAME}
MASTER_IP="${LWS_LEADER_ADDRESS}"
MASTER_PORT=12890
SERVE_PORT=${PORT}
SERVE_NAME="glm-5.1-w8a8"
export LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so.2
export VLLM_ASCEND_BALANCE_SCHEDULING=1
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export VLLM_ASCEND_ENABLE_MLAPO=1
export ASCEND_TRANSPORT_PRINT=1
# 根据节点角色选择不同的启动命令
if [ "${LWS_WORKER_INDEX}" -eq 0 ]; then
# Master节点(节点0)
vllm serve ${MODEL_PATH} \
--host 0.0.0.0 \
--port ${SERVE_PORT} \
--served-model-name ${SERVE_NAME} \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $MASTER_IP \
--data-parallel-rpc-port $MASTER_PORT \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--seed 1024 \
--max-num-seqs 100 \
--max-model-len 202750 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--quantization ascend \
--enable-chunked-prefill \
--enable-prefix-caching \
--async-scheduling \
--enable-auto-tool-choice \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--additional-config '{"rot_path": "rot-path", "enable_npugraph_ex": true, "fuse_muls_add":true,"multistream_overlap_shared_expert": true}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
else
# Worker节点(节点1+)
vllm serve ${MODEL_PATH} \
--host 0.0.0.0 \
--port ${SERVE_PORT} \
--served-model-name ${SERVE_NAME} \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank ${LWS_WORKER_INDEX} \
--data-parallel-address ${MASTER_IP} \
--data-parallel-rpc-port ${MASTER_PORT} \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--seed 1024 \
--max-num-seqs 100 \
--max-model-len 202750 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--quantization ascend \
--enable-chunked-prefill \
--enable-prefix-caching \
--async-scheduling \
--enable-auto-tool-choice \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--additional-config '{"rot_path": "rot-path", "enable_npugraph_ex": true, "fuse_muls_add":true,"multistream_overlap_shared_expert": true}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[4, 8, 12, 16,20,24,28, 32]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
fiMaster节点提示如下,表明服务启动成功
(APIServer pid=128059) INFO 02-04 16:48:44 [launcher.py:46] 路由:/v2/rerank,方法:POST
(APIServer pid=128059) INFO 02-04 16:48:44 [launcher.py:46] 路由:/pooling,方法:POST
(APIServer pid=128059) INFO: 服务器进程 [128059] 已启动
(APIServer pid=128059) INFO: 等待应用程序启动。
(APIServer pid=128059) INFO: 应用程序启动完成。
在Master节点上执行如下命令
curl http://localhost:${PORT}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.1-w8a8",
"prompt": "请简要介绍一下人工智能的发展前景",
"max_tokens": 200,
"temperature": 0.7,
"echo": false
}'