`GLM-4.7`模型昇腾迁移部署指导

作者

张伟、杨蕾蕾、黄启航、杨覃娟、刘芮金、谢艺言

一. 模型概述及场景

1.1 模型基本信息：

智谱 AI 与清华大学联合研发的 GLM-4.7 作为开源混合专家（MoE）大语言模型。GLM-4.7 专为智能体设计，旨在统一推理、编码和智能体交互三大核心能力，在性能与效率上实现突破，尤其适用于高吞吐、低延迟的分布式场景，如智能运维、故障诊断和自动化代理应用。

与 `GLM-4.6` 相比，`GLM-4.7` 带来了几个关键改进：

1. 核心编码能力：与前代模型 `GLM-4.6` 相比，`GLM-4.7` 在多语言智能体编程和终端任务方面取得了显著提升，包括在 SWE-bench 上达到（73.8%，+5.8%）、SWE-bench 多语言版上达到（66.7%，+12.9%），以及 Terminal Bench 2.0 上达到（41%，+16.5%）。此外，`GLM-4.7` 支持“先思考后行动”，在 Claude Code、Kilo Code、Cline 和 Roo Code 等主流智能体框架中的复杂任务上表现显著增强。
2. 氛围化编程（Vibe Coding）：`GLM-4.7` 在提升 UI 质量方面迈出了一大步，能够生成更简洁、更现代的网页，并制作出布局更精准、尺寸更合理的美观幻灯片。
3. 工具使用能力：`GLM-4.7` 在工具使用方面实现了显著进步，在 τ^2-Bench 等基准测试以及通过 BrowseComp 进行的网页浏览任务中均展现出明显更优的性能。
4. 复杂推理能力：`GLM-4.7` 在数学与推理能力方面大幅提升，在 HLE（Humanity’s Last Exam）基准测试中相比 `GLM-4.6` 取得了（42.8%，+12.4%）的成绩。
此外，在聊天、创意写作和角色扮演等多种场景中，您也能看到显著的性能提升。

1.2 模型权重：

模型名称：GLM-4.7
下载地址：
GLM-4.7 huggingface下载地址
 GLM-4.7 ModelScope下载地址
核心代码仓：
vllm v0.13.0
vllm_ascend v0.13.0rc1

1.3 使用场景：推理

二. 准备运行环境

2.1 环境准备

NPU驱动固件：25.2.1
硬件配置：Atlas 800T A2（8*64G）
部署卡类型：910B2
部署方式：双机16卡
表 1 版本配套表

配套	版本	环境准备指导
Python	>= 3.10, < 3.12	-
CANN	8.3.rc2	-
torch	2.8.0	-
torch_npu	2.8.0	-
vllm	0.13.0	-
vllm_ascend	0.13.0rc1	-

2.2 镜像

镜像：quay.io/ascend/vllm-ascend:v0.13.0rc1
镜像网站：vllm_ascend镜像网站
该镜像为vllm_ascend官方镜像，可通过如下命令拉取：

    docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1
    ```

## 三.运行指导
### 3.1 模型权重下载
权重下载链接：https://www.modelscope.cn/models/ZhipuAI/GLM-4.7

1. 模型下载脚本：modelscope_glm_47.py

    ```
    # 验证 ModelScope token
    from modelscope.hub.api import HubApi
    api = HubApi()
    api.login('ms-61a2382e-0e41-4fff-b504-167967a77de9')

    # 模型下载
    from modelscope import snapshot_download
    # 基于实际环境调整cache_dir、local_dir
    model_dir = snapshot_download('ZhipuAI/GLM-4.7', cache_dir='/opt/data/verification/models/.cache', local_dir='/opt/data/verification/models/GLM-4.7')   
    ```
2. 执行脚本下载权重

    ```
    python modelscope_glm_47.py
    ```

### 3.2 镜像下载
1. 拉取quay.io/ascend/vllm-ascend:v0.13.0rc1镜像

    ```
    # 拉取镜像
    docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1

    # 查看镜像
    docker images
    ```
2. 启动容器

    ```
    #!/bin/sh
    NAME=$1                             # 容器名称（用户自定义）
    IMAGE=$2                            # 镜像 IMAGE ID （通过上述docker images查询）

    docker run \
    --name $NAME \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /opt/data:/root/.cache \
    -it $IMAGE bash
    ```
3. 执行脚本进入容器

    ```
    bash docker_run.sh vllm-ascend-v0.13.0rc1 524b38df7811
    ```
    在容器外（如其他终端）查看当前docker容器：

    ```
    docker ps
    ```
    在容器外（如其他终端）进入容器：

    ```
    docker exec -it vllm-ascend-v0.13.0rc1 /bin/bash
    ```
### 3.3 启动推理服务
双机部署，启动脚本如下：

1. 主节点
启动脚本：glm_47_infer_node0.sh

    ```
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    source /usr/local/Ascend/nnal/atb/set_env.sh

    nic_name="xxx"              # 使用ifconfig 查看node0业务面IP对应的网卡名称
    local_ip="xx.xx.xx.xx"     # 使用ifconfig 查看node0业务面IP
    model_path="xxx"           # 模型权重路径

    export HCCL_IF_IP=${local_ip}
    export GLOO_SOCKET_IFNAME=${nic_name}
    export TP_SOCKET_IFNAME=${nic_name}
    export HCCL_SOCKET_IFNAME=${nic_name}
    export OMP_PROC_BIND=false
    export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    export OMP_NUM_THREADS=1
    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
    export VLLM_USE_V1=1
    export HCCL_BUFFSIZE=1024
    export HCCL_OP_EXPANSION_MODE=AIV

    vllm serve ${model_path} \
    --host 0.0.0.0 \
    --port 8080 \
    --data-parallel-size 2 \
    --data-parallel-size-local 1 \
    --data-parallel-address ${local_ip} \
    --data-parallel-rpc-port 8082 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm47 \
    --max-model-len 8192 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 8 \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --additional-config '{"enable_multistream_moe":false, "chunked_prefill_for_mla":true, "ascend_scheduler_config":{"enabled":true}, "enable_weight_nz_layout":true}' 
    ```
2. 从节点
启动脚本：glm_47_infer_node1.sh

    ```
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    source /usr/local/Ascend/nnal/atb/set_env.sh

    nic_name="xxx"              # 使用ifconfig 查看node0业务面IP对应的网卡名称
    local_ip="xx.xx.xx.xx"     # 使用ifconfig 查看node0业务面IP
    model_path="xxx"           # 模型权重路径
    node0_ip="xx.xx.xx.xx"

    export HCCL_IF_IP=${local_ip}
    export GLOO_SOCKET_IFNAME=${nic_name}
    export TP_SOCKET_IFNAME=${nic_name}
    export HCCL_SOCKET_IFNAME=${nic_name}
    export OMP_PROC_BIND=false
    export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    export OMP_NUM_THREADS=1
    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
    export VLLM_USE_V1=1
    export HCCL_BUFFSIZE=1024
    export HCCL_OP_EXPANSION_MODE=AIV

    vllm serve ${model_path} \
    --host 0.0.0.0 \
    --port 8080 \
    --data-parallel-size 2 \
    --data-parallel-size-local 1 \
    --data-parallel-start-rank 1 \
    --headless \
    --data-parallel-address ${node0_ip} \
    --data-parallel-rpc-port 8082 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm47 \
    --max-model-len 8192 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 8 \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --additional-config '{"enable_multistream_moe":false,"chunked_prefill_for_mla":true,"ascend_scheduler_config":{"enabled":true}, "enable_weight_nz_layout":true}'
    ```
3. 工具调用补丁 
启动命令加入以下参数为开启工具调用解析
    ```
    --enable-auto-tool-choice \
    --tool-call-parser glm47 \
    ```
    开启工具调用解析需进行以下处理
    通过`pip show vllm`查看`vllm`安装路径，进入`/vllm-workspace/vllm/vllm/tool_parsers`目录做如下操作：
    * 新建`glm47_moe_tool_parser.py`文件增加以下代码：

    ```
    # SPDX-License-Identifier: Apache-2.0
    # SPDX-FileCopyrightText: Copyright contributors to the vLLM project


    import regex as re

    from vllm.logger import init_logger
    from vllm.tokenizers import TokenizerLike
    from vllm.tool_parsers.glm4_moe_tool_parser import Glm4MoeModelToolParser

    logger = init_logger(__name__)


    class Glm47MoeModelToolParser(Glm4MoeModelToolParser):
        def __init__(self, tokenizer: TokenizerLike):
            super().__init__(tokenizer)
            self.func_detail_regex = re.compile(
                r"<tool_call>(.*?)(<arg_key>.*?)?</tool_call>", re.DOTALL
            )
            self.func_arg_regex = re.compile(
                r"<arg_key>(.*?)</arg_key>(?:\\n|\s)*<arg_value>(.*?)</arg_value>",
                re.DOTALL,
            )

    ```
    * 在`__init__.py`中新增以下代码：

    ```
    "glm47": (
        "glm47_moe_tool_parser",
        "Glm47MoeModelToolParser",
    ),

    ```
    ![](./images/1769474302750_image.png)

4. 思考解析补丁
启动命令加入以下参数为开启思考解析，思考内容会通过reasoning_content字段返回
    ```
    --reasoning-parser glm45 \
    ```
    通过`pip show vllm`查看`vllm`安装路径，进入`/vllm-workspace/vllm/vllm/reasoning`目录做如下操作：
    * 对glm4_moe_reasoning_parser.py文件打补丁
    补丁命令如下：

    ```
    patch glm4_moe_reasoning_parser.py < glm4_moe_reasoning_parser.patch

    ```
    补丁文件glm4_moe_reasoning_parser.patch如下：

    ```
    --- glm4_moe_reasoning_parser.py        2026-01-12 10:53:02.113178000 +0800
    +++ new_glm4_moe_reasoning_parser.py    2026-01-12 10:53:09.998648500 +0800
    @@ -1,171 +1,13 @@
    # SPDX-License-Identifier: Apache-2.0
    # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
    
    -from collections.abc import Sequence
    +from vllm.reasoning.holo2_reasoning_parser import Holo2ReasoningParser
    
    -from transformers import PreTrainedTokenizerBase
    
    -from vllm.entrypoints.openai.protocol import ChatCompletionRequest, DeltaMessage
    -from vllm.logger import init_logger
    -from vllm.reasoning import ReasoningParser
    -
    -logger = init_logger(__name__)
    -
    -
    -class Glm4MoeModelReasoningParser(ReasoningParser):
    +class Glm4MoeModelReasoningParser(Holo2ReasoningParser):
        """
    -    Reasoning parser for the Glm4MoeModel model.
    -
    -    The Glm4MoeModel model uses </think>...</think> tokens to denote reasoning
    -    text within its output. The model provides a strict switch to disable
    -    reasoning output via the 'enable_thinking=False' parameter. This parser
    -    extracts the reasoning content enclosed by <RichMediaReference> and </think> tokens
    -    from the model's output.
    +    Reasoning parser for the Glm4MoeModel model,which inherits from
    +    `Holo2ReasoningParser`.
        """
    
    -    def __init__(self, tokenizer: PreTrainedTokenizerBase, *args, **kwargs):
    -        super().__init__(tokenizer, *args, **kwargs)
    -        self.think_start_token = "</think>"
    -        self.think_end_token = "</think>"
    -        self.assistant_token = "<|assistant|>"
    -
    -        if not self.model_tokenizer:
    -            raise ValueError(
    -                "The model tokenizer must be passed to the ReasoningParser "
    -                "constructor during construction."
    -            )
    -
    -        self.think_start_token_id = self.vocab.get(self.think_start_token)
    -        self.think_end_token_id = self.vocab.get(self.think_end_token)
    -        self.assistant_token_id = self.vocab.get(self.assistant_token)
    -        if (
    -            self.think_start_token_id is None
    -            or self.think_end_token_id is None
    -            or self.assistant_token_id is None
    -        ):
    -            raise RuntimeError(
    -                "Glm4MoeModel reasoning parser could not locate "
    -                "think start/end or assistant tokens in the tokenizer!"
    -            )
    -
    -    def is_reasoning_end(self, input_ids: list[int]) -> bool:
    -        """
    -        GLM's chat template has <RichMediaReference>superscript: tokens after every
    -        <|assistant|> token. Thus, we need to check if <RichMediaReference> is
    -        after the most recent <|assistant|> token (if present).
    -        """
    -        for token_id in input_ids[::-1]:
    -            if token_id == self.think_end_token_id:
    -                return True
    -            elif token_id == self.assistant_token_id:
    -                return False
    -        return False
    -
    -    def extract_content_ids(self, input_ids: list[int]) -> list[int]:
    -        """
    -        Extract the content after the end tokens
    -        """
    -        if self.think_end_token_id not in input_ids[:-1]:
    -            return []
    -        else:
    -            return input_ids[input_ids.index(self.think_end_token_id) + 1 :]
    -
    -    def extract_reasoning_streaming(
    -        self,
    -        previous_text: str,
    -        current_text: str,
    -        delta_text: str,
    -        previous_token_ids: Sequence[int],
    -        current_token_ids: Sequence[int],
    -        delta_token_ids: Sequence[int],
    -    ) -> DeltaMessage | None:
    -        """
    -        Extract reasoning content from a delta message.
    -        Handles streaming output where previous + delta = current.
    -        Uses token IDs for faster processing.
    -        For text <RichMediaReference>abcsuperscript:xyz:
    -        - 'abc' goes to reasoning
    -        - 'xyz' goes to content
    -        """
    -        # Skip single special tokens
    -        if len(delta_token_ids) == 1 and (
    -            delta_token_ids[0] in [self.think_start_token_id, self.think_end_token_id]
    -        ):
    -            return None
    -
    -        if self.think_start_token_id in previous_token_ids:
    -            if self.think_end_token_id in delta_token_ids:
    -                # </think> in previous, superscript: in delta,
    -                # extract reasoning content
    -                end_index = delta_text.find(self.think_end_token)
    -                reasoning = delta_text[:end_index]
    -                content = delta_text[end_index + len(self.think_end_token) :]
    -                return DeltaMessage(
    -                    reasoning=reasoning,
    -                    content=content if content else None,
    -                )
    -            elif self.think_end_token_id in previous_token_ids:
    -                # <RichMediaReference> in previous, superscript: in previous,
    -                # reasoning content continues
    -                return DeltaMessage(content=delta_text)
    -            else:
    -                # <RichMediaReference> in previous, no <RichMediaReference> in previous or delta,
    -                # reasoning content continues
    -                return DeltaMessage(reasoning=delta_text)
    -        elif self.think_start_token_id in delta_token_ids:
    -            if self.think_end_token_id in delta_token_ids:
    -                # <RichMediaReference> in delta, superscript: in delta, extract reasoning content
    -                start_index = delta_text.find(self.think_start_token)
    -                end_index = delta_text.find(self.think_end_token)
    -                reasoning = delta_text[
    -                    start_index + len(self.think_start_token) : end_index
    -                ]
    -                content = delta_text[end_index + len(self.think_end_token) :]
    -                return DeltaMessage(
    -                    reasoning=reasoning,
    -                    content=content if content else None,
    -                )
    -            else:
    -                # <RichMediaReference> in delta, no <RichMediaReference> in delta,
    -                # reasoning content continues
    -                return DeltaMessage(reasoning=delta_text)
    -        else:
    -            # thinking is disabled, just content
    -            return DeltaMessage(content=delta_text)
    -
    -    def extract_reasoning(
    -        self, model_output: str, request: ChatCompletionRequest
    -    ) -> tuple[str | None, str | None]:
    -        """
    -        Extract reasoning content from the model output.
    -
    -        For text <RichMediaReference>abcsuperscript:xyz:
    -        - 'abc' goes to reasoning
    -        - 'xyz' goes to content
    -
    -        Returns:
    -            tuple[Optional[str], Optional[str]]: reasoning content and content
    -        """
    -
    -        # Check if the model output contains the <RichMediaReference> and </think> tokens.
    -        if (
    -            self.think_start_token not in model_output
    -            or self.think_end_token not in model_output
    -        ):
    -            return None, model_output
    -        # Check if the <RichMediaReference> is present in the model output, remove it
    -        # if it is present.
    -        model_output_parts = model_output.partition(self.think_start_token)
    -        model_output = (
    -            model_output_parts[2] if model_output_parts[1] else model_output_parts[0]
    -        )
    -        # Check if the model output contains the <RichMediaReference> tokens.
    -        # If the end token is not found, return the model output as is.
    -        if self.think_end_token not in model_output:
    -            return None, model_output
    -
    -        # Extract reasoning content from the model output.
    -        reasoning, _, content = model_output.partition(self.think_end_token)
    -
    -        final_content = content or None
    -        return reasoning, final_content
    +    pass
    ```
    * 对holo2_moe_reasoning_parser.py文件打补丁
    补丁命令如下：

    ```
    patch holo2_moe_reasoning_parser.py < holo2_moe_reasoning_parser.patch

    ```
    补丁文件holo2_moe_reasoning_parser.patch如下：

    ```
    --- holo2_reasoning_parser.py   2026-01-12 10:25:29.213407100 +0800
    +++ new_holo2_reasoning_parser.py       2026-01-12 10:24:45.961472400 +0800
    @@ -46,9 +46,10 @@
            # all requests in the structured output manager. So it is important that without
            # user specified chat template args, the default thinking is True.
    
    -        enable_thinking = bool(chat_kwargs.get("thinking", True))
    -
    -        if enable_thinking:
    +        thinking = bool(chat_kwargs.get("thinking", True))
    +        enable_thinking = bool(chat_kwargs.get("enable_thinking", True))
    +        thinking = thinking and enable_thinking
    +        if thinking:
                self._parser = DeepSeekR1ReasoningParser(tokenizer, *args, **kwargs)
            else:
                self._parser = IdentityReasoningParser(tokenizer, *args, **kwargs)
    ```

5. 拉起服务

    ```
    # node0 容器内 执行
    bash glm_47_infer_node0.sh
    # node1 容器内 执行
    bash glm_47_infer_node1.sh

    # 【测试阶段】未避免长时间测试过程中异常中断影响，将服务挂在后台运行
    # node0 容器内 执行 （执行记录截图详见下文）
    nohup bash glm_47_infer_node0.sh > glm_47_infer_node0_log.txt 2>&1 &
    tail -f glm_47_infer_node0_log.txt
    # node1 容器内 执行 （执行记录截图详见下文）
    nohup bash glm_47_infer_node1.sh > glm_47_infer_node1_log.txt 2>&1 &
    tail -f glm_47_infer_node1_log.txt
    ```
6. 推理验证
请求体如下：

    ```
    # 推理请求
    curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d \
    '{
        "model": "glm47",
        "max_tokens":10,
        "messages": [
            {
                "role": "user",
                "content": "请做一下自我介绍"
            }
        ]
    }'
    ```

## 四.常见问题
1. GLM4.7在编码等复杂任务中偶尔会出现重复输出的问题，表现为陷入循环思考、重复大段输出等情况。该问题在官方API、bf16模型及w8a8模型中均有复现，其中w8a8模型的复现概率相对更高。目前认为，此问题是由后训练过程导致的模型能力异常。

一. 模型概述及场景

1.1 模型基本信息：

与 `GLM-4.6` 相比，`GLM-4.7` 带来了几个关键改进：

1. 核心编码能力：与前代模型 `GLM-4.6` 相比，`GLM-4.7` 在多语言智能体编程和终端任务方面取得了显著提升，包括在 SWE-bench 上达到（73.8%，+5.8%）、SWE-bench 多语言版上达到（66.7%，+12.9%），以及 Terminal Bench 2.0 上达到（41%，+16.5%）。此外，`GLM-4.7` 支持“先思考后行动”，在 Claude Code、Kilo Code、Cline 和 Roo Code 等主流智能体框架中的复杂任务上表现显著增强。
2. 氛围化编程（Vibe Coding）：`GLM-4.7` 在提升 UI 质量方面迈出了一大步，能够生成更简洁、更现代的网页，并制作出布局更精准、尺寸更合理的美观幻灯片。
3. 工具使用能力：`GLM-4.7` 在工具使用方面实现了显著进步，在 τ^2-Bench 等基准测试以及通过 BrowseComp 进行的网页浏览任务中均展现出明显更优的性能。
4. 复杂推理能力：`GLM-4.7` 在数学与推理能力方面大幅提升，在 HLE（Humanity’s Last Exam）基准测试中相比 `GLM-4.6` 取得了（42.8%，+12.4%）的成绩。
此外，在聊天、创意写作和角色扮演等多种场景中，您也能看到显著的性能提升。

1.2 模型权重：

1.3 使用场景：推理

二. 准备运行环境

2.1 环境准备

NPU驱动固件：25.2.1
硬件配置：Atlas 800T A2（8*64G）
部署卡类型：910B2
部署方式：双机16卡
表 1 版本配套表

配套	版本	环境准备指导
Python	>= 3.10, < 3.12	-
CANN	8.3.rc2	-
torch	2.8.0	-
torch_npu	2.8.0	-
vllm	0.13.0	-
vllm_ascend	0.13.0rc1	-

2.2 镜像

镜像：quay.io/ascend/vllm-ascend:v0.13.0rc1
镜像网站：vllm_ascend镜像网站
该镜像为vllm_ascend官方镜像，可通过如下命令拉取：

    docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1
    ```

## 三.运行指导
### 3.1 模型权重下载
权重下载链接：https://www.modelscope.cn/models/ZhipuAI/GLM-4.7

1. 模型下载脚本：modelscope_glm_47.py

    ```
    # 验证 ModelScope token
    from modelscope.hub.api import HubApi
    api = HubApi()
    api.login('ms-61a2382e-0e41-4fff-b504-167967a77de9')

    # 模型下载
    from modelscope import snapshot_download
    # 基于实际环境调整cache_dir、local_dir
    model_dir = snapshot_download('ZhipuAI/GLM-4.7', cache_dir='/opt/data/verification/models/.cache', local_dir='/opt/data/verification/models/GLM-4.7')   
    ```
2. 执行脚本下载权重

    ```
    python modelscope_glm_47.py
    ```

### 3.2 镜像下载
1. 拉取quay.io/ascend/vllm-ascend:v0.13.0rc1镜像

    ```
    # 拉取镜像
    docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1

    # 查看镜像
    docker images
    ```
2. 启动容器

    ```
    #!/bin/sh
    NAME=$1                             # 容器名称（用户自定义）
    IMAGE=$2                            # 镜像 IMAGE ID （通过上述docker images查询）

    docker run \
    --name $NAME \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /opt/data:/root/.cache \
    -it $IMAGE bash
    ```
3. 执行脚本进入容器

    ```
    bash docker_run.sh vllm-ascend-v0.13.0rc1 524b38df7811
    ```
    在容器外（如其他终端）查看当前docker容器：

    ```
    docker ps
    ```
    在容器外（如其他终端）进入容器：

    ```
    docker exec -it vllm-ascend-v0.13.0rc1 /bin/bash
    ```
### 3.3 启动推理服务
双机部署，启动脚本如下：

1. 主节点
启动脚本：glm_47_infer_node0.sh

    ```
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    source /usr/local/Ascend/nnal/atb/set_env.sh

    nic_name="xxx"              # 使用ifconfig 查看node0业务面IP对应的网卡名称
    local_ip="xx.xx.xx.xx"     # 使用ifconfig 查看node0业务面IP
    model_path="xxx"           # 模型权重路径

    export HCCL_IF_IP=${local_ip}
    export GLOO_SOCKET_IFNAME=${nic_name}
    export TP_SOCKET_IFNAME=${nic_name}
    export HCCL_SOCKET_IFNAME=${nic_name}
    export OMP_PROC_BIND=false
    export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    export OMP_NUM_THREADS=1
    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
    export VLLM_USE_V1=1
    export HCCL_BUFFSIZE=1024
    export HCCL_OP_EXPANSION_MODE=AIV

    vllm serve ${model_path} \
    --host 0.0.0.0 \
    --port 8080 \
    --data-parallel-size 2 \
    --data-parallel-size-local 1 \
    --data-parallel-address ${local_ip} \
    --data-parallel-rpc-port 8082 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm47 \
    --max-model-len 8192 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 8 \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --additional-config '{"enable_multistream_moe":false, "chunked_prefill_for_mla":true, "ascend_scheduler_config":{"enabled":true}, "enable_weight_nz_layout":true}' 
    ```
2. 从节点
启动脚本：glm_47_infer_node1.sh

    ```
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    source /usr/local/Ascend/nnal/atb/set_env.sh

    nic_name="xxx"              # 使用ifconfig 查看node0业务面IP对应的网卡名称
    local_ip="xx.xx.xx.xx"     # 使用ifconfig 查看node0业务面IP
    model_path="xxx"           # 模型权重路径
    node0_ip="xx.xx.xx.xx"

    export HCCL_IF_IP=${local_ip}
    export GLOO_SOCKET_IFNAME=${nic_name}
    export TP_SOCKET_IFNAME=${nic_name}
    export HCCL_SOCKET_IFNAME=${nic_name}
    export OMP_PROC_BIND=false
    export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    export OMP_NUM_THREADS=1
    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
    export VLLM_USE_V1=1
    export HCCL_BUFFSIZE=1024
    export HCCL_OP_EXPANSION_MODE=AIV

    vllm serve ${model_path} \
    --host 0.0.0.0 \
    --port 8080 \
    --data-parallel-size 2 \
    --data-parallel-size-local 1 \
    --data-parallel-start-rank 1 \
    --headless \
    --data-parallel-address ${node0_ip} \
    --data-parallel-rpc-port 8082 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm47 \
    --max-model-len 8192 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 8 \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --additional-config '{"enable_multistream_moe":false,"chunked_prefill_for_mla":true,"ascend_scheduler_config":{"enabled":true}, "enable_weight_nz_layout":true}'
    ```
3. 工具调用补丁 
启动命令加入以下参数为开启工具调用解析
    ```
    --enable-auto-tool-choice \
    --tool-call-parser glm47 \
    ```
    开启工具调用解析需进行以下处理
    通过`pip show vllm`查看`vllm`安装路径，进入`/vllm-workspace/vllm/vllm/tool_parsers`目录做如下操作：
    * 新建`glm47_moe_tool_parser.py`文件增加以下代码：

    ```
    # SPDX-License-Identifier: Apache-2.0
    # SPDX-FileCopyrightText: Copyright contributors to the vLLM project


    import regex as re

    from vllm.logger import init_logger
    from vllm.tokenizers import TokenizerLike
    from vllm.tool_parsers.glm4_moe_tool_parser import Glm4MoeModelToolParser

    logger = init_logger(__name__)


    class Glm47MoeModelToolParser(Glm4MoeModelToolParser):
        def __init__(self, tokenizer: TokenizerLike):
            super().__init__(tokenizer)
            self.func_detail_regex = re.compile(
                r"<tool_call>(.*?)(<arg_key>.*?)?</tool_call>", re.DOTALL
            )
            self.func_arg_regex = re.compile(
                r"<arg_key>(.*?)</arg_key>(?:\\n|\s)*<arg_value>(.*?)</arg_value>",
                re.DOTALL,
            )

    ```
    * 在`__init__.py`中新增以下代码：

    ```
    "glm47": (
        "glm47_moe_tool_parser",
        "Glm47MoeModelToolParser",
    ),

    ```
    ![](./images/1769474302750_image.png)

4. 思考解析补丁
启动命令加入以下参数为开启思考解析，思考内容会通过reasoning_content字段返回
    ```
    --reasoning-parser glm45 \
    ```
    通过`pip show vllm`查看`vllm`安装路径，进入`/vllm-workspace/vllm/vllm/reasoning`目录做如下操作：
    * 对glm4_moe_reasoning_parser.py文件打补丁
    补丁命令如下：

    ```
    patch glm4_moe_reasoning_parser.py < glm4_moe_reasoning_parser.patch

    ```
    补丁文件glm4_moe_reasoning_parser.patch如下：

    ```
    --- glm4_moe_reasoning_parser.py        2026-01-12 10:53:02.113178000 +0800
    +++ new_glm4_moe_reasoning_parser.py    2026-01-12 10:53:09.998648500 +0800
    @@ -1,171 +1,13 @@
    # SPDX-License-Identifier: Apache-2.0
    # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
    
    -from collections.abc import Sequence
    +from vllm.reasoning.holo2_reasoning_parser import Holo2ReasoningParser
    
    -from transformers import PreTrainedTokenizerBase
    
    -from vllm.entrypoints.openai.protocol import ChatCompletionRequest, DeltaMessage
    -from vllm.logger import init_logger
    -from vllm.reasoning import ReasoningParser
    -
    -logger = init_logger(__name__)
    -
    -
    -class Glm4MoeModelReasoningParser(ReasoningParser):
    +class Glm4MoeModelReasoningParser(Holo2ReasoningParser):
        """
    -    Reasoning parser for the Glm4MoeModel model.
    -
    -    The Glm4MoeModel model uses </think>...</think> tokens to denote reasoning
    -    text within its output. The model provides a strict switch to disable
    -    reasoning output via the 'enable_thinking=False' parameter. This parser
    -    extracts the reasoning content enclosed by <RichMediaReference> and </think> tokens
    -    from the model's output.
    +    Reasoning parser for the Glm4MoeModel model,which inherits from
    +    `Holo2ReasoningParser`.
        """
    
    -    def __init__(self, tokenizer: PreTrainedTokenizerBase, *args, **kwargs):
    -        super().__init__(tokenizer, *args, **kwargs)
    -        self.think_start_token = "</think>"
    -        self.think_end_token = "</think>"
    -        self.assistant_token = "<|assistant|>"
    -
    -        if not self.model_tokenizer:
    -            raise ValueError(
    -                "The model tokenizer must be passed to the ReasoningParser "
    -                "constructor during construction."
    -            )
    -
    -        self.think_start_token_id = self.vocab.get(self.think_start_token)
    -        self.think_end_token_id = self.vocab.get(self.think_end_token)
    -        self.assistant_token_id = self.vocab.get(self.assistant_token)
    -        if (
    -            self.think_start_token_id is None
    -            or self.think_end_token_id is None
    -            or self.assistant_token_id is None
    -        ):
    -            raise RuntimeError(
    -                "Glm4MoeModel reasoning parser could not locate "
    -                "think start/end or assistant tokens in the tokenizer!"
    -            )
    -
    -    def is_reasoning_end(self, input_ids: list[int]) -> bool:
    -        """
    -        GLM's chat template has <RichMediaReference>superscript: tokens after every
    -        <|assistant|> token. Thus, we need to check if <RichMediaReference> is
    -        after the most recent <|assistant|> token (if present).
    -        """
    -        for token_id in input_ids[::-1]:
    -            if token_id == self.think_end_token_id:
    -                return True
    -            elif token_id == self.assistant_token_id:
    -                return False
    -        return False
    -
    -    def extract_content_ids(self, input_ids: list[int]) -> list[int]:
    -        """
    -        Extract the content after the end tokens
    -        """
    -        if self.think_end_token_id not in input_ids[:-1]:
    -            return []
    -        else:
    -            return input_ids[input_ids.index(self.think_end_token_id) + 1 :]
    -
    -    def extract_reasoning_streaming(
    -        self,
    -        previous_text: str,
    -        current_text: str,
    -        delta_text: str,
    -        previous_token_ids: Sequence[int],
    -        current_token_ids: Sequence[int],
    -        delta_token_ids: Sequence[int],
    -    ) -> DeltaMessage | None:
    -        """
    -        Extract reasoning content from a delta message.
    -        Handles streaming output where previous + delta = current.
    -        Uses token IDs for faster processing.
    -        For text <RichMediaReference>abcsuperscript:xyz:
    -        - 'abc' goes to reasoning
    -        - 'xyz' goes to content
    -        """
    -        # Skip single special tokens
    -        if len(delta_token_ids) == 1 and (
    -            delta_token_ids[0] in [self.think_start_token_id, self.think_end_token_id]
    -        ):
    -            return None
    -
    -        if self.think_start_token_id in previous_token_ids:
    -            if self.think_end_token_id in delta_token_ids:
    -                # </think> in previous, superscript: in delta,
    -                # extract reasoning content
    -                end_index = delta_text.find(self.think_end_token)
    -                reasoning = delta_text[:end_index]
    -                content = delta_text[end_index + len(self.think_end_token) :]
    -                return DeltaMessage(
    -                    reasoning=reasoning,
    -                    content=content if content else None,
    -                )
    -            elif self.think_end_token_id in previous_token_ids:
    -                # <RichMediaReference> in previous, superscript: in previous,
    -                # reasoning content continues
    -                return DeltaMessage(content=delta_text)
    -            else:
    -                # <RichMediaReference> in previous, no <RichMediaReference> in previous or delta,
    -                # reasoning content continues
    -                return DeltaMessage(reasoning=delta_text)
    -        elif self.think_start_token_id in delta_token_ids:
    -            if self.think_end_token_id in delta_token_ids:
    -                # <RichMediaReference> in delta, superscript: in delta, extract reasoning content
    -                start_index = delta_text.find(self.think_start_token)
    -                end_index = delta_text.find(self.think_end_token)
    -                reasoning = delta_text[
    -                    start_index + len(self.think_start_token) : end_index
    -                ]
    -                content = delta_text[end_index + len(self.think_end_token) :]
    -                return DeltaMessage(
    -                    reasoning=reasoning,
    -                    content=content if content else None,
    -                )
    -            else:
    -                # <RichMediaReference> in delta, no <RichMediaReference> in delta,
    -                # reasoning content continues
    -                return DeltaMessage(reasoning=delta_text)
    -        else:
    -            # thinking is disabled, just content
    -            return DeltaMessage(content=delta_text)
    -
    -    def extract_reasoning(
    -        self, model_output: str, request: ChatCompletionRequest
    -    ) -> tuple[str | None, str | None]:
    -        """
    -        Extract reasoning content from the model output.
    -
    -        For text <RichMediaReference>abcsuperscript:xyz:
    -        - 'abc' goes to reasoning
    -        - 'xyz' goes to content
    -
    -        Returns:
    -            tuple[Optional[str], Optional[str]]: reasoning content and content
    -        """
    -
    -        # Check if the model output contains the <RichMediaReference> and </think> tokens.
    -        if (
    -            self.think_start_token not in model_output
    -            or self.think_end_token not in model_output
    -        ):
    -            return None, model_output
    -        # Check if the <RichMediaReference> is present in the model output, remove it
    -        # if it is present.
    -        model_output_parts = model_output.partition(self.think_start_token)
    -        model_output = (
    -            model_output_parts[2] if model_output_parts[1] else model_output_parts[0]
    -        )
    -        # Check if the model output contains the <RichMediaReference> tokens.
    -        # If the end token is not found, return the model output as is.
    -        if self.think_end_token not in model_output:
    -            return None, model_output
    -
    -        # Extract reasoning content from the model output.
    -        reasoning, _, content = model_output.partition(self.think_end_token)
    -
    -        final_content = content or None
    -        return reasoning, final_content
    +    pass
    ```
    * 对holo2_moe_reasoning_parser.py文件打补丁
    补丁命令如下：

    ```
    patch holo2_moe_reasoning_parser.py < holo2_moe_reasoning_parser.patch

    ```
    补丁文件holo2_moe_reasoning_parser.patch如下：

    ```
    --- holo2_reasoning_parser.py   2026-01-12 10:25:29.213407100 +0800
    +++ new_holo2_reasoning_parser.py       2026-01-12 10:24:45.961472400 +0800
    @@ -46,9 +46,10 @@
            # all requests in the structured output manager. So it is important that without
            # user specified chat template args, the default thinking is True.
    
    -        enable_thinking = bool(chat_kwargs.get("thinking", True))
    -
    -        if enable_thinking:
    +        thinking = bool(chat_kwargs.get("thinking", True))
    +        enable_thinking = bool(chat_kwargs.get("enable_thinking", True))
    +        thinking = thinking and enable_thinking
    +        if thinking:
                self._parser = DeepSeekR1ReasoningParser(tokenizer, *args, **kwargs)
            else:
                self._parser = IdentityReasoningParser(tokenizer, *args, **kwargs)
    ```

5. 拉起服务

    ```
    # node0 容器内 执行
    bash glm_47_infer_node0.sh
    # node1 容器内 执行
    bash glm_47_infer_node1.sh

    # 【测试阶段】未避免长时间测试过程中异常中断影响，将服务挂在后台运行
    # node0 容器内 执行 （执行记录截图详见下文）
    nohup bash glm_47_infer_node0.sh > glm_47_infer_node0_log.txt 2>&1 &
    tail -f glm_47_infer_node0_log.txt
    # node1 容器内 执行 （执行记录截图详见下文）
    nohup bash glm_47_infer_node1.sh > glm_47_infer_node1_log.txt 2>&1 &
    tail -f glm_47_infer_node1_log.txt
    ```
6. 推理验证
请求体如下：

    ```
    # 推理请求
    curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d \
    '{
        "model": "glm47",
        "max_tokens":10,
        "messages": [
            {
                "role": "user",
                "content": "请做一下自我介绍"
            }
        ]
    }'
    ```

## 四.常见问题
1. GLM4.7在编码等复杂任务中偶尔会出现重复输出的问题，表现为陷入循环思考、重复大段输出等情况。该问题在官方API、bf16模型及w8a8模型中均有复现，其中w8a8模型的复现概率相对更高。目前认为，此问题是由后训练过程导致的模型能力异常。

GLM-4.7模型昇腾迁移部署指导

作者

一. 模型概述及场景

1.1 模型基本信息：

1.2 模型权重：

1.3 使用场景：推理

二. 准备运行环境

2.1 环境准备

2.2 镜像

GLM-4.7模型昇腾迁移部署指导

作者

一. 模型概述及场景

1.1 模型基本信息：

1.2 模型权重：

1.3 使用场景：推理

二. 准备运行环境

2.1 环境准备

2.2 镜像

`GLM-4.7`模型昇腾迁移部署指导

`GLM-4.7`模型昇腾迁移部署指导