GLM-4.6 是智谱最新的旗舰模型,GLM-4.5拥有 3550 亿总参数量,其中 320 亿活跃参数。 本项目提供该模型在X86+昇腾NPU服务器上,基于vllm-ascend的32K序列推理部署指导
GLM4.5/4.6量化参考文档:https://ai.gitcode.com/Ascend-SACT/GLM-4.6-w8a8/blob/main/GLM4.6%E9%87%8F%E5%8C%96%E6%8C%87%E5%AF%BC.md 注意事项:
GLM4.5/4.6 有160 MOE专家,CANN 8.2RC1不支持这种场景,需要使用最新cann 8.3 alpha版本
docker pull quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11docker run -itd -u 0 --ipc=host --privileged \
--shm-size=128g \
--name glm4.5 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /home/model/step3:/model \
-v /opt:/opt \
-p 1025:1025 \
-it quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11 bashdocker exec -it glm4.5 bash容器内执行以下命令:
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
##for x86-64
pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"
apt-get install libnet-ifconfig-wrapper-perl 容器内执行以下命令:
cd /workspace/
git clone https://github.com/vllm-project/vllm.git
cd vllm/
git checkout v0.11.0如果无法访问 github,尝试使用 https://gitproxy.click 代理下载。
cd /workspace/
git clone https://gitproxy.click/https://github.com/vllm-project/vllm.git
cd vllm/
git checkout v0.11.0支持GLM4.6的长序列特性已经合入vLLM-ascend main分支,但暂无发布版本,我们需要checkout commit id "d0086d432ac0f40ce6ea37e802c2fb364cc43481" 容器内执行以下命令:
cd /workspace/
git clone https://gitproxy.click/https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend/
git checkout d0086d432ac0f40ce6ea37e802c2fb364cc43481容器内执行以下命令:
export VLLM_VERSION=0.11.0
cd /workspace/vllm
pip install -r requirements/build.txt
VLLM_TARGET_DEVICE=empty pip install -v -e .容器内执行以下命令:
cd /workspace/vllm-ascend/
pip install -r requirements.txt
pip install -e .1)创建推理服务启动脚本 /root/infer.sh,内容如下:
#!/bin/bash
MODEL_PATH=
SERVICE_PORT=1205
MODEL_NAME=
TENSOR_PARALLEL=8
DATA_PARALLEL=2
MAX_SEQ_LEN=32768
EXTENSION_ARGS=
while getopts ":p:m:t:l:e:" opt; do
case $opt in
p)
SERVICE_PORT="$OPTARG"
echo "service port set with : $SERVICE_PORT"
;;
m)
MODEL_PATH="$OPTARG"
echo "local model path set with : $MODEL_PATH"
;;
n)
MODEL_NAME="$OPTARG"
echo "service name set with : $MODEL_NAME"
;;
t)
TENSOR_PARALLEL="$OPTARG"
echo "Tensor Parallel set with : $TENSOR_PARALLEL"
;;
e)
EXTENSION_ARGS="$OPTARG"
echo "extensions args with : $EX_ARGS"
;;
l)
MAX_SEQ_LEN=$OPTARG
echo "MAX SEQ LEN set with : $MAX_SEQ_LEN"
;;
\?)
echo "Error: invalid arg -$OPTARG" >&2
exit 1
;;
:)
echo "Error: arg -$OPTARG needs a valule" >&2
exit 1
;;
esac
done
args_error=
if [ "$MODEL_PATH" = "" ]; then
echo "Error: missing required arg \'-m <model path>\'"
args_error=1
fi
if [ "$SERVICE_PORT" = "" ]; then
echo "Error: missing required arg \'-p <service port>\'"
args_error=1
fi
if [ "$args_error" != "" ]; then
exit 1
fi
if [ "$MODEL_NAME" = "" ]; then
MODEL_NAME=`basename ${MODEL_PATH}`
fi
unset HTTP_PROXY
unset HTTPS_PROXY
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/driver/:$LD_LIBRARY_PATH
# MC2分层通信
export HCCL_INTRA_PCIE_ENABLE=0
export HCCL_INTRA_ROCE_ENABLE=1
export HCCL_OP_EXPANSION_MODE=AIV
#export ASCEND_RT_VISIBLE_DEVICES=0
export HCCL_BUFFSIZE=2048
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_NUM_THREADS=1
export OMP_PROC_BIND=false
export VLLM_USE_V1=1
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
export VLLM_ASCEND_ENABLE_FLASHCOMM=1
timestamp=$(date +"%Y-%m-%d-%H%M%S")
logfile="/root/vllm-ascond_$timestamp.log"
cd /workspace/vllm
let "DATA_PARALLEL=16/${TENSOR_PARALLEL}"
if [ $MAX_SEQ_LEN -gt 131072 ]; then
MAX_SEQ_LEN=131072
fi
if [ $MAX_SEQ_LEN -gt 32768 ] && [ $TENSOR_PARALLEL -le 8 ]; then
echo "****"
echo "WARNING: current max-model-len $MAX_SEQ_LEN > 32768, tensor parallel 16 is suggested"
fi
vllm serve ${MODEL_PATH} --seed 1024 --max-model-len $MAX_SEQ_LEN --max-num-batched-tokens 4096 --quantization ascend --enable-expert-parallel --port ${SERVICE_PORT} --served-model-name ${MODEL_NAME} --reasoning-parser glm45 --enable-auto-tool-choice --tool-call-parser glm45 --trust-remote-code --gpu_memory_utilization 0.9 \
--async-scheduling \
--max-num-seqs 64 \
--compilation-config '{"cudagraph_capture_sizes": [1,4,8,16,32,48]}' \
-tp ${TENSOR_PARALLEL} -dp ${DATA_PARALLEL} ${EXTENSION_ARGS} 2>&1 | tee $logfile2) 执行以下命令启动推理服务:
chmod +x /root/infer.sh
/root/infer.sh -m <量化模型路径> -n GLM4.5 -p <服务监听端口> -t 8 -d 2 -l 327683) 服务器启动后执行以下命令发送推理请求
curl http://127.0.0.1:1205/v1/chat/completions \
-H "Content-Type: application/json" \
-d \
'{
"model": "GLM4.6",
"max_tokens": 2048, "do_sample": true,
"messages": [
{
"role": "user",
"content": "介绍上海市的气候人文经济"
}
]
}'