1. 模型概述及场景

GLM-4.7，您全新的编程伙伴，现已推出，具备以下特性：

核心编码能力：与前代模型 GLM-4.6 相比，GLM-4.7 在多语言智能体编程和终端任务方面取得了显著提升，包括在 SWE-bench 上达到（73.8%，+5.8%）、SWE-bench 多语言版上达到（66.7%，+12.9%），以及 Terminal Bench 2.0 上达到（41%，+16.5%）。此外，GLM-4.7 支持“先思考后行动”，在 Claude Code、Kilo Code、Cline 和 Roo Code 等主流智能体框架中的复杂任务上表现显著增强。氛围化编程（Vibe Coding）：GLM-4.7 在提升 UI 质量方面迈出了一大步，能够生成更简洁、更现代的网页，并制作出布局更精准、尺寸更合理的美观幻灯片。
工具使用能力：GLM-4.7 在工具使用方面实现了显著进步，在 τ^2-Bench 等基准测试以及通过 BrowseComp 进行的网页浏览任务中均展现出明显更优的性能。
复杂推理能力：GLM-4.7 在数学与推理能力方面大幅提升，在 HLE（Humanity’s Last Exam）基准测试中相比 GLM-4.6 取得了（42.8%，+12.4%）的成绩。此外，在聊天、创意写作和角色扮演等多种场景中，您也能看到显著的性能提升。

本模型是通过msmodelslim量化后的模型，模型地址：https://www.modelscope.cn/models/Eco-Tech/GLM-4.7-W8A8

2. 准备运行环境

2.1 版本配套表

硬件版本

组件	版本
硬件环境	910B（8卡）
cann 驱动	25.0.rc1.1

软件版本

本环境采用镜像安装，镜像位置 swr.cn-north-4.myhuaweicloud.com/ascend-sact/ascend-910b-ubuntu:v2.0，对应的软件版本由镜像决定，当前镜像版本中个组件版本如下：

组件	版本
OS	Ubuntu 24.04 x86_64
Python	3.11.14
ascend-toolkit	8.3.RC2
torch_npu	2.8.0
sglang	0.5.5.post3
triton-ascend	3.2.0rc4

3. 运行指导

3.1 下载镜像

命令： docker pull swr.cn-north-4.myhuaweicloud.com/ascend-sact/ascend-910b-ubuntu:v2.0

具内容可参考： https://gitcode.com/Ascend-SACT/ascend-docker

3.2 启动容器

具体内容参考：https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/softwareinst/instg/instg_0020.html?Mode=DockerIns&OS=openEuler&Software=cannToolKit#ZH-CN_TOPIC_0000002118090620

3.3 切换conda环境

命令：

conda activate ascend-infer

3.4 检查triton组件

该模型使用sglang框，triton组件需要保持唯一性，否则启动出错，具体可参考： https://gitcode.com/Ascend-SACT/ascend-docker/issues/3

命令：

pip list|grep triton

如果命令输出显示有两个triton组件（如下），需要执行卸载重新安装命令

triton 3.5.0
triton-ascend 3.2.0rc4

命令：

pip uninstall triton triton-ascend -y
pip install triton-ascend==3.2.0rc4

注意：使用triton-ascend 3.2.0版本启动服务会报错，需要使用triton-ascend 3.2.0rc4，如果pip install triton-ascend安装的是 triton-ascend 3.2.0版本，需要使用 pip install triton-ascend==3.2.0rc4命令安装

3.5 下载模型

命令：

git lfs install
git clone https://www.modelscope.cn/Eco-Tech/GLM-4.7-W8A8.git

3.6 启动模型服务

设置环境变量

命令：

export HCCL_INTRA_ROCE_ENABLE=1

修改启动脚本

修改模型权重文件下的glm47_boot.sh文件：1：修改正确的模型目录；2；将--tool-call-parser glm47修改为 --tool-call-parser glm45；增加执行权限：chmod +x glm47_boot.sh

vllm 启动模型服务

命令：

./glm47_boot.sh

提示如下，表明服务启动成功

(APIServer pid=128059) INFO 02-04 16:48:44 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=128059) INFO 02-04 16:48:44 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=128059) INFO: Started server process [128059]
(APIServer pid=128059) INFO: Waiting for application startup.
(APIServer pid=128059) INFO: Application startup complete.

sglang 启动模型服务

命令：

export HCCL_INTRA_ROCE_ENABLE=1
export HCCL_BUFFSIZE=1024
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export OMP_NUM_THREADS=64
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_DETERMINISTIC="true"
export HCCL_OP_EXPANSION_MODE=AIV
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1
export SGLANG_BACKEND=rpc
export SGLANG_DEVICE=npu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl kernel.sched_migration_cost_ns=50000
export VLLM_CUDAGRAPH_MODE="FULL_DECODE_ONLY"
export VLLM_CUDAGRAPH_CAPTURE_SIZES="[1,4,8,16,32,48,64]"

python -m sglang.launch_server \
    --model ${MODEL_PATH}/GLM-4.7-W8A8 \
    --context-length 131072 \
    --quantization w8a8_int8 \
    --port 8321 \
    --served-model-name GLM-4.7-w8a8 \
    --trust-remote-code \
    --mem-fraction-static 0.9 \
    --max-running-requests 64 \
    --tensor-parallel-size 8

可将上述脚本保存到shell文件中执行

验证模型服务

curl http://localhost:8321/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-4.7-w8a8", 
    "prompt": "请简要介绍一下人工智能的发展前景",
    "max_tokens": 200, 
    "temperature": 0.7,
    "echo": false 
  }'

4. 注意事项

1、默认服务端口是 8016，如果出现端口冲突，可以修改为没有占用的端口

2、使用npu-smi info 查看npu是否被占用，并确定npu的编号，修改glm47_boot.sh中对应的环境变量 ASCEND_RT_VISIBLE_DEVICES