GLM-4.7模型昇腾迁移部署指导张伟、杨蕾蕾、黄启航、杨覃娟、刘芮金、谢艺言
智谱 AI 与清华大学联合研发的 GLM-4.7 作为开源混合专家(MoE)大语言模型。GLM-4.7 专为智能体设计,旨在统一推理、编码和智能体交互三大核心能力,在性能与效率上实现突破,尤其适用于高吞吐、低延迟的分布式场景,如智能运维、故障诊断和自动化代理应用。
与 `GLM-4.6` 相比,`GLM-4.7` 带来了几个关键改进:
1. 核心编码能力:与前代模型 `GLM-4.6` 相比,`GLM-4.7` 在多语言智能体编程和终端任务方面取得了显著提升,包括在 SWE-bench 上达到(73.8%,+5.8%)、SWE-bench 多语言版上达到(66.7%,+12.9%),以及 Terminal Bench 2.0 上达到(41%,+16.5%)。此外,`GLM-4.7` 支持“先思考后行动”,在 Claude Code、Kilo Code、Cline 和 Roo Code 等主流智能体框架中的复杂任务上表现显著增强。
2. 氛围化编程(Vibe Coding):`GLM-4.7` 在提升 UI 质量方面迈出了一大步,能够生成更简洁、更现代的网页,并制作出布局更精准、尺寸更合理的美观幻灯片。
3. 工具使用能力:`GLM-4.7` 在工具使用方面实现了显著进步,在 τ^2-Bench 等基准测试以及通过 BrowseComp 进行的网页浏览任务中均展现出明显更优的性能。
4. 复杂推理能力:`GLM-4.7` 在数学与推理能力方面大幅提升,在 HLE(Humanity’s Last Exam)基准测试中相比 `GLM-4.6` 取得了(42.8%,+12.4%)的成绩。
此外,在聊天、创意写作和角色扮演等多种场景中,您也能看到显著的性能提升。模型名称:GLM-4.7
下载地址:
GLM-4.7 huggingface下载地址
GLM-4.7 ModelScope下载地址
核心代码仓:
vllm v0.13.0
vllm_ascend v0.13.0rc1
NPU驱动固件:25.2.1
硬件配置:Atlas 800T A2(8*64G)
部署卡类型:910B2
部署方式:双机16卡
表 1 版本配套表
| 配套 | 版本 | 环境准备指导 |
|---|---|---|
| Python | >= 3.10, < 3.12 | - |
| CANN | 8.3.rc2 | - |
| torch | 2.8.0 | - |
| torch_npu | 2.8.0 | - |
| vllm | 0.13.0 | - |
| vllm_ascend | 0.13.0rc1 | - |
镜像:quay.io/ascend/vllm-ascend:v0.13.0rc1
镜像网站:vllm_ascend镜像网站
该镜像为vllm_ascend官方镜像,可通过如下命令拉取:
docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1
```
## 三.运行指导
### 3.1 模型权重下载
权重下载链接:https://www.modelscope.cn/models/ZhipuAI/GLM-4.7
1. 模型下载脚本:modelscope_glm_47.py
```
# 验证 ModelScope token
from modelscope.hub.api import HubApi
api = HubApi()
api.login('ms-61a2382e-0e41-4fff-b504-167967a77de9')
# 模型下载
from modelscope import snapshot_download
# 基于实际环境调整cache_dir、local_dir
model_dir = snapshot_download('ZhipuAI/GLM-4.7', cache_dir='/opt/data/verification/models/.cache', local_dir='/opt/data/verification/models/GLM-4.7')
```
2. 执行脚本下载权重
```
python modelscope_glm_47.py
```
### 3.2 镜像下载
1. 拉取quay.io/ascend/vllm-ascend:v0.13.0rc1镜像
```
# 拉取镜像
docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1
# 查看镜像
docker images
```
2. 启动容器
```
#!/bin/sh
NAME=$1 # 容器名称(用户自定义)
IMAGE=$2 # 镜像 IMAGE ID (通过上述docker images查询)
docker run \
--name $NAME \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /opt/data:/root/.cache \
-it $IMAGE bash
```
3. 执行脚本进入容器
```
bash docker_run.sh vllm-ascend-v0.13.0rc1 524b38df7811
```
在容器外(如其他终端)查看当前docker容器:
```
docker ps
```
在容器外(如其他终端)进入容器:
```
docker exec -it vllm-ascend-v0.13.0rc1 /bin/bash
```
### 3.3 启动推理服务
双机部署,启动脚本如下:
1. 主节点
启动脚本:glm_47_infer_node0.sh
```
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
nic_name="xxx" # 使用ifconfig 查看node0业务面IP对应的网卡名称
local_ip="xx.xx.xx.xx" # 使用ifconfig 查看node0业务面IP
model_path="xxx" # 模型权重路径
export HCCL_IF_IP=${local_ip}
export GLOO_SOCKET_IFNAME=${nic_name}
export TP_SOCKET_IFNAME=${nic_name}
export HCCL_SOCKET_IFNAME=${nic_name}
export OMP_PROC_BIND=false
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export OMP_NUM_THREADS=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export HCCL_OP_EXPANSION_MODE=AIV
vllm serve ${model_path} \
--host 0.0.0.0 \
--port 8080 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address ${local_ip} \
--data-parallel-rpc-port 8082 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm47 \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--max-num-seqs 8 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--additional-config '{"enable_multistream_moe":false, "chunked_prefill_for_mla":true, "ascend_scheduler_config":{"enabled":true}, "enable_weight_nz_layout":true}'
```
2. 从节点
启动脚本:glm_47_infer_node1.sh
```
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
nic_name="xxx" # 使用ifconfig 查看node0业务面IP对应的网卡名称
local_ip="xx.xx.xx.xx" # 使用ifconfig 查看node0业务面IP
model_path="xxx" # 模型权重路径
node0_ip="xx.xx.xx.xx"
export HCCL_IF_IP=${local_ip}
export GLOO_SOCKET_IFNAME=${nic_name}
export TP_SOCKET_IFNAME=${nic_name}
export HCCL_SOCKET_IFNAME=${nic_name}
export OMP_PROC_BIND=false
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export OMP_NUM_THREADS=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export HCCL_OP_EXPANSION_MODE=AIV
vllm serve ${model_path} \
--host 0.0.0.0 \
--port 8080 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--headless \
--data-parallel-address ${node0_ip} \
--data-parallel-rpc-port 8082 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm47 \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--max-num-seqs 8 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--additional-config '{"enable_multistream_moe":false,"chunked_prefill_for_mla":true,"ascend_scheduler_config":{"enabled":true}, "enable_weight_nz_layout":true}'
```
3. 工具调用补丁
启动命令加入以下参数为开启工具调用解析
```
--enable-auto-tool-choice \
--tool-call-parser glm47 \
```
开启工具调用解析需进行以下处理
通过`pip show vllm`查看`vllm`安装路径,进入`/vllm-workspace/vllm/vllm/tool_parsers`目录做如下操作:
* 新建`glm47_moe_tool_parser.py`文件增加以下代码:
```
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import regex as re
from vllm.logger import init_logger
from vllm.tokenizers import TokenizerLike
from vllm.tool_parsers.glm4_moe_tool_parser import Glm4MoeModelToolParser
logger = init_logger(__name__)
class Glm47MoeModelToolParser(Glm4MoeModelToolParser):
def __init__(self, tokenizer: TokenizerLike):
super().__init__(tokenizer)
self.func_detail_regex = re.compile(
r"<tool_call>(.*?)(<arg_key>.*?)?</tool_call>", re.DOTALL
)
self.func_arg_regex = re.compile(
r"<arg_key>(.*?)</arg_key>(?:\\n|\s)*<arg_value>(.*?)</arg_value>",
re.DOTALL,
)
```
* 在`__init__.py`中新增以下代码:
```
"glm47": (
"glm47_moe_tool_parser",
"Glm47MoeModelToolParser",
),
```

4. 思考解析补丁
启动命令加入以下参数为开启思考解析,思考内容会通过reasoning_content字段返回
```
--reasoning-parser glm45 \
```
通过`pip show vllm`查看`vllm`安装路径,进入`/vllm-workspace/vllm/vllm/reasoning`目录做如下操作:
* 对glm4_moe_reasoning_parser.py文件打补丁
补丁命令如下:
```
patch glm4_moe_reasoning_parser.py < glm4_moe_reasoning_parser.patch
```
补丁文件glm4_moe_reasoning_parser.patch如下:
```
--- glm4_moe_reasoning_parser.py 2026-01-12 10:53:02.113178000 +0800
+++ new_glm4_moe_reasoning_parser.py 2026-01-12 10:53:09.998648500 +0800
@@ -1,171 +1,13 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-from collections.abc import Sequence
+from vllm.reasoning.holo2_reasoning_parser import Holo2ReasoningParser
-from transformers import PreTrainedTokenizerBase
-from vllm.entrypoints.openai.protocol import ChatCompletionRequest, DeltaMessage
-from vllm.logger import init_logger
-from vllm.reasoning import ReasoningParser
-
-logger = init_logger(__name__)
-
-
-class Glm4MoeModelReasoningParser(ReasoningParser):
+class Glm4MoeModelReasoningParser(Holo2ReasoningParser):
"""
- Reasoning parser for the Glm4MoeModel model.
-
- The Glm4MoeModel model uses </think>...</think> tokens to denote reasoning
- text within its output. The model provides a strict switch to disable
- reasoning output via the 'enable_thinking=False' parameter. This parser
- extracts the reasoning content enclosed by <RichMediaReference> and </think> tokens
- from the model's output.
+ Reasoning parser for the Glm4MoeModel model,which inherits from
+ `Holo2ReasoningParser`.
"""
- def __init__(self, tokenizer: PreTrainedTokenizerBase, *args, **kwargs):
- super().__init__(tokenizer, *args, **kwargs)
- self.think_start_token = "</think>"
- self.think_end_token = "</think>"
- self.assistant_token = "<|assistant|>"
-
- if not self.model_tokenizer:
- raise ValueError(
- "The model tokenizer must be passed to the ReasoningParser "
- "constructor during construction."
- )
-
- self.think_start_token_id = self.vocab.get(self.think_start_token)
- self.think_end_token_id = self.vocab.get(self.think_end_token)
- self.assistant_token_id = self.vocab.get(self.assistant_token)
- if (
- self.think_start_token_id is None
- or self.think_end_token_id is None
- or self.assistant_token_id is None
- ):
- raise RuntimeError(
- "Glm4MoeModel reasoning parser could not locate "
- "think start/end or assistant tokens in the tokenizer!"
- )
-
- def is_reasoning_end(self, input_ids: list[int]) -> bool:
- """
- GLM's chat template has <RichMediaReference>superscript: tokens after every
- <|assistant|> token. Thus, we need to check if <RichMediaReference> is
- after the most recent <|assistant|> token (if present).
- """
- for token_id in input_ids[::-1]:
- if token_id == self.think_end_token_id:
- return True
- elif token_id == self.assistant_token_id:
- return False
- return False
-
- def extract_content_ids(self, input_ids: list[int]) -> list[int]:
- """
- Extract the content after the end tokens
- """
- if self.think_end_token_id not in input_ids[:-1]:
- return []
- else:
- return input_ids[input_ids.index(self.think_end_token_id) + 1 :]
-
- def extract_reasoning_streaming(
- self,
- previous_text: str,
- current_text: str,
- delta_text: str,
- previous_token_ids: Sequence[int],
- current_token_ids: Sequence[int],
- delta_token_ids: Sequence[int],
- ) -> DeltaMessage | None:
- """
- Extract reasoning content from a delta message.
- Handles streaming output where previous + delta = current.
- Uses token IDs for faster processing.
- For text <RichMediaReference>abcsuperscript:xyz:
- - 'abc' goes to reasoning
- - 'xyz' goes to content
- """
- # Skip single special tokens
- if len(delta_token_ids) == 1 and (
- delta_token_ids[0] in [self.think_start_token_id, self.think_end_token_id]
- ):
- return None
-
- if self.think_start_token_id in previous_token_ids:
- if self.think_end_token_id in delta_token_ids:
- # </think> in previous, superscript: in delta,
- # extract reasoning content
- end_index = delta_text.find(self.think_end_token)
- reasoning = delta_text[:end_index]
- content = delta_text[end_index + len(self.think_end_token) :]
- return DeltaMessage(
- reasoning=reasoning,
- content=content if content else None,
- )
- elif self.think_end_token_id in previous_token_ids:
- # <RichMediaReference> in previous, superscript: in previous,
- # reasoning content continues
- return DeltaMessage(content=delta_text)
- else:
- # <RichMediaReference> in previous, no <RichMediaReference> in previous or delta,
- # reasoning content continues
- return DeltaMessage(reasoning=delta_text)
- elif self.think_start_token_id in delta_token_ids:
- if self.think_end_token_id in delta_token_ids:
- # <RichMediaReference> in delta, superscript: in delta, extract reasoning content
- start_index = delta_text.find(self.think_start_token)
- end_index = delta_text.find(self.think_end_token)
- reasoning = delta_text[
- start_index + len(self.think_start_token) : end_index
- ]
- content = delta_text[end_index + len(self.think_end_token) :]
- return DeltaMessage(
- reasoning=reasoning,
- content=content if content else None,
- )
- else:
- # <RichMediaReference> in delta, no <RichMediaReference> in delta,
- # reasoning content continues
- return DeltaMessage(reasoning=delta_text)
- else:
- # thinking is disabled, just content
- return DeltaMessage(content=delta_text)
-
- def extract_reasoning(
- self, model_output: str, request: ChatCompletionRequest
- ) -> tuple[str | None, str | None]:
- """
- Extract reasoning content from the model output.
-
- For text <RichMediaReference>abcsuperscript:xyz:
- - 'abc' goes to reasoning
- - 'xyz' goes to content
-
- Returns:
- tuple[Optional[str], Optional[str]]: reasoning content and content
- """
-
- # Check if the model output contains the <RichMediaReference> and </think> tokens.
- if (
- self.think_start_token not in model_output
- or self.think_end_token not in model_output
- ):
- return None, model_output
- # Check if the <RichMediaReference> is present in the model output, remove it
- # if it is present.
- model_output_parts = model_output.partition(self.think_start_token)
- model_output = (
- model_output_parts[2] if model_output_parts[1] else model_output_parts[0]
- )
- # Check if the model output contains the <RichMediaReference> tokens.
- # If the end token is not found, return the model output as is.
- if self.think_end_token not in model_output:
- return None, model_output
-
- # Extract reasoning content from the model output.
- reasoning, _, content = model_output.partition(self.think_end_token)
-
- final_content = content or None
- return reasoning, final_content
+ pass
```
* 对holo2_moe_reasoning_parser.py文件打补丁
补丁命令如下:
```
patch holo2_moe_reasoning_parser.py < holo2_moe_reasoning_parser.patch
```
补丁文件holo2_moe_reasoning_parser.patch如下:
```
--- holo2_reasoning_parser.py 2026-01-12 10:25:29.213407100 +0800
+++ new_holo2_reasoning_parser.py 2026-01-12 10:24:45.961472400 +0800
@@ -46,9 +46,10 @@
# all requests in the structured output manager. So it is important that without
# user specified chat template args, the default thinking is True.
- enable_thinking = bool(chat_kwargs.get("thinking", True))
-
- if enable_thinking:
+ thinking = bool(chat_kwargs.get("thinking", True))
+ enable_thinking = bool(chat_kwargs.get("enable_thinking", True))
+ thinking = thinking and enable_thinking
+ if thinking:
self._parser = DeepSeekR1ReasoningParser(tokenizer, *args, **kwargs)
else:
self._parser = IdentityReasoningParser(tokenizer, *args, **kwargs)
```
5. 拉起服务
```
# node0 容器内 执行
bash glm_47_infer_node0.sh
# node1 容器内 执行
bash glm_47_infer_node1.sh
# 【测试阶段】未避免长时间测试过程中异常中断影响,将服务挂在后台运行
# node0 容器内 执行 (执行记录截图详见下文)
nohup bash glm_47_infer_node0.sh > glm_47_infer_node0_log.txt 2>&1 &
tail -f glm_47_infer_node0_log.txt
# node1 容器内 执行 (执行记录截图详见下文)
nohup bash glm_47_infer_node1.sh > glm_47_infer_node1_log.txt 2>&1 &
tail -f glm_47_infer_node1_log.txt
```
6. 推理验证
请求体如下:
```
# 推理请求
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d \
'{
"model": "glm47",
"max_tokens":10,
"messages": [
{
"role": "user",
"content": "请做一下自我介绍"
}
]
}'
```
## 四.常见问题
1. GLM4.7在编码等复杂任务中偶尔会出现重复输出的问题,表现为陷入循环思考、重复大段输出等情况。该问题在官方API、bf16模型及w8a8模型中均有复现,其中w8a8模型的复现概率相对更高。目前认为,此问题是由后训练过程导致的模型能力异常。