DeepSeek-V4 Flash Agentic 增强补丁

最新镜像： 已发布新的 DeepSeek V4 镜像，已包含本补丁实现，无需再执行补丁，请下载并使用最新镜像：

A2 镜像名称：quay.io/ascend/vllm-ascend:deepseekv4

A3 镜像名称：quay.io/ascend/vllm-ascend:deepseekv4-a3

0. 新镜像部署指导

DeepSeek V4 已发布新的 vllm-ascend 镜像，推荐优先使用新镜像部署。部署指导请参考官方文档：

DeepSeek-V4-Flash：https://docs.vllm.ai/projects/ascend/en/v0.18.0/tutorials/models/DeepSeek-V4-Flash.html
DeepSeek-V4-Pro：https://docs.vllm.ai/projects/ascend/en/v0.18.0/tutorials/models/DeepSeek-V4-Pro.html

新镜像已包含 DeepSeek V4 相关支持，下面的补丁流程主要用于历史 Day 0 镜像场景。

1. 简介

本目录提供的是 DeepSeek V4 最小功能补丁及其对应部署方式。

本次补丁目标是让 DeepSeek-V4-Flash-w8a8-mtp 在 vllm-ascend 上支持以下能力：

--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4

免责声明：

本仓库补丁主要面向昇腾官方 Day 0 发布版本，参考上游实现为 DeepSeek V4 在 vllm-ascend 上补充 agentic 能力（reasoning / tool calling）。
由于相关能力尚未正式合入上游发布版本，当前补丁可能存在稳定性或兼容性风险，不建议直接用于生产环境。

2. 更新日志

2026-04-30 更新：

修正并完善reasoning_effort映射关系
通过在 serving 层将 tool_choice="required" 请求改走 deepseek_v4 parser 路径，解决了 async scheduling + MTP speculative decoding 组合下 function call 请求直接 400 的问题
通过让 named tool 和 required 在流式、非流式链路里统一复用 deepseek_v4 parser，解决了上游版本 tool_calls 提取不一致、流式返回不稳定、finish_reason 语义不统一的问题

2026-04-26 更新：

解决 Function call 双重嵌套 "arguments" 字段，导致工具调用解析失败的问题
解决 reasoning_effort="high" 时的问题

2026-04-24 更新：

基础功能来自 vllm PR #40760，以及 streaming tool call 相关后续修复参考 vllm PR #40805
auto + stream=true 下防止 DSML 片段泄漏的修复来自 issue #40801 对应的 PR #40805
同时补充了 typed tool args 和多工具调用场景的回归验证

3. 补丁应用

基础镜像：

A2:
- quay.io/ascend/vllm-ascend:v0.13.0rc3
- quay.io/ascend/vllm-ascend:v0.13.0rc3-openeuler
A3:
- quay.io/ascend/vllm-ascend:v0.13.0rc3-a3
- quay.io/ascend/vllm-ascend:v0.13.0rc3-a3-openeuler

准备变量：

PATCH_DIR=/path/to/patch
VLLM_REPO=/vllm-workspace/vllm
VLLM_ASCEND_REPO=/vllm-workspace/vllm-ascend
MODEL_PATH=/models/DeepSeek-V4-Flash-w8a8-mtp

应用 vllm 补丁：

cd "$VLLM_REPO"
git apply --check "$PATCH_DIR/deepseek-v4-agentic-support.patch"
git apply "$PATCH_DIR/deepseek-v4-agentic-support.patch"

如需回退 vllm 补丁：

cd "$VLLM_REPO"
git apply -R --check "$PATCH_DIR/deepseek-v4-agentic-support.patch"
git apply -R "$PATCH_DIR/deepseek-v4-agentic-support.patch"
git diff --stat

说明：

git apply -R 会按补丁反向回退文件内容，适合“刚打完补丁，需要整体撤销”的场景
若 git apply -R --check 失败，通常说明补丁应用后文件又被修改过，当前代码状态已经与补丁上下文不完全一致
如果补丁改动已经单独提交，建议使用 git revert <commit> 回退该提交，而不是再次执行 git apply -R
请检查最新补丁文件。若该文件从 Windows 设备复制而来，你可能需要执行 dos2unix 命令或 sed -i 's/\r$//' /path/to/patch 命令，将 Windows 换行符转换为Unix 格式。

4. 服务启动

单机 TP8 启动：

cd /workspace

export HCCL_OP_EXPANSION_MODE=AIV
export USE_MULTI_BLOCK_POOL=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ACL_OP_INIT_MODE=1
export TRITON_ALL_BLOCKS_PARALLEL=1

vllm serve "$MODEL_PATH" \
  --max_model_len 131072 \
  --max-num-batched-tokens 8192 \
  --served-model-name dsv4 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 16 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --quantization ascend \
  --async-scheduling \
  --additional-config '{"enable_cpu_binding": "true", "multistream_overlap_shared_expert": true}' \
  --speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --port 8000

5. 基本验证

5.1 Reasoning 验证

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")
model = "dsv4"
messages = [
    {
        "role": "user",
        "content": "What is 17*19? Return only the final integer.",
    }
]

# Non-think
resp = client.chat.completions.create(
    model=model,
    messages=messages,
)
print("non_think:", resp.choices[0].message.content)

# Think High
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "high",
        },
    },
)
print("think_high content:", resp.choices[0].message.content)
print("think_high reasoning:", getattr(resp.choices[0].message, "reasoning", None))

# Think Max
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "max",
        },
    },
)
print("think_max content:", resp.choices[0].message.content)
print("think_max reasoning:", getattr(resp.choices[0].message, "reasoning", None))

验证预期：

Non-think 返回最终答案，例如 323
Think High 返回最终答案，并携带 reasoning 字段
Think Max 返回最终答案，并携带 reasoning 字段

5.2 Streaming Function Calling 验证

from collections import defaultdict
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")
tool_calls = defaultdict(lambda: {"name": "", "arguments": ""})
visible_content = []

stream = client.chat.completions.create(
    model="dsv4",
    messages=[
        {
            "role": "user",
            "content": "Call the weather tool for Beijing today.",
        }
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Query weather by city.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"],
                },
            },
        }
    ],
    temperature=0,
    max_tokens=256,
    stream=True,
)

for chunk in stream:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta
    if delta.content:
        visible_content.append(delta.content)
    for tc in delta.tool_calls or []:
        entry = tool_calls[tc.index]
        if tc.function:
            if tc.function.name:
                entry["name"] = tc.function.name
            if tc.function.arguments:
                entry["arguments"] += tc.function.arguments

print("visible_content:", repr("".join(visible_content)))
print("tool_calls:", dict(tool_calls))

验证预期：

可见文本中不应出现 DSML / tool_calls 等内部标记碎片
能正常流式拼出 get_weather
参数应为合法 JSON，例如 {"location": "Beijing"}

本次已覆盖的特性 / 回归场景：

typed tool args：integer / boolean / array / string
多工具调用：单次返回两个 tool calls
多工具调用：streaming 下可正确增量拼接两个 tool calls

6.常见问题

问题: 请问 --speculative-config 中使用 "method": "deepseek_mtp" 和使用 "method": "mtp" 哪个更好？原始部署(https://docs.vllm.ai/projects/ascend/en/v0.13.0/tutorials/DeepSeek-V4.html)使用的是 "deepseek_mtp"，而本文使用的是 "mtp"
答案: 因为 deepseek_mtp 已被弃用，两者使用效果相同
问题：部署时报错：提示transformers需要更新
答案：环境变量TRITON_ALL_BLOCKS_PARALLEL=1不加会报transformers错误

DeepSeek-V4 Flash Agentic 增强补丁

最新镜像： 已发布新的 DeepSeek V4 镜像，已包含本补丁实现，无需再执行补丁，请下载并使用最新镜像：

A2 镜像名称：quay.io/ascend/vllm-ascend:deepseekv4

A3 镜像名称：quay.io/ascend/vllm-ascend:deepseekv4-a3

0. 新镜像部署指导

DeepSeek V4 已发布新的 vllm-ascend 镜像，推荐优先使用新镜像部署。部署指导请参考官方文档：

DeepSeek-V4-Flash：https://docs.vllm.ai/projects/ascend/en/v0.18.0/tutorials/models/DeepSeek-V4-Flash.html
DeepSeek-V4-Pro：https://docs.vllm.ai/projects/ascend/en/v0.18.0/tutorials/models/DeepSeek-V4-Pro.html

新镜像已包含 DeepSeek V4 相关支持，下面的补丁流程主要用于历史 Day 0 镜像场景。

1. 简介

本目录提供的是 DeepSeek V4 最小功能补丁及其对应部署方式。

本次补丁目标是让 DeepSeek-V4-Flash-w8a8-mtp 在 vllm-ascend 上支持以下能力：

--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4

免责声明：

本仓库补丁主要面向昇腾官方 Day 0 发布版本，参考上游实现为 DeepSeek V4 在 vllm-ascend 上补充 agentic 能力（reasoning / tool calling）。
由于相关能力尚未正式合入上游发布版本，当前补丁可能存在稳定性或兼容性风险，不建议直接用于生产环境。

2. 更新日志

2026-04-30 更新：

修正并完善reasoning_effort映射关系
通过在 serving 层将 tool_choice="required" 请求改走 deepseek_v4 parser 路径，解决了 async scheduling + MTP speculative decoding 组合下 function call 请求直接 400 的问题
通过让 named tool 和 required 在流式、非流式链路里统一复用 deepseek_v4 parser，解决了上游版本 tool_calls 提取不一致、流式返回不稳定、finish_reason 语义不统一的问题

2026-04-26 更新：

解决 Function call 双重嵌套 "arguments" 字段，导致工具调用解析失败的问题
解决 reasoning_effort="high" 时的问题

2026-04-24 更新：

基础功能来自 vllm PR #40760，以及 streaming tool call 相关后续修复参考 vllm PR #40805
auto + stream=true 下防止 DSML 片段泄漏的修复来自 issue #40801 对应的 PR #40805
同时补充了 typed tool args 和多工具调用场景的回归验证

3. 补丁应用

基础镜像：

A2:
- quay.io/ascend/vllm-ascend:v0.13.0rc3
- quay.io/ascend/vllm-ascend:v0.13.0rc3-openeuler
A3:
- quay.io/ascend/vllm-ascend:v0.13.0rc3-a3
- quay.io/ascend/vllm-ascend:v0.13.0rc3-a3-openeuler

准备变量：

PATCH_DIR=/path/to/patch
VLLM_REPO=/vllm-workspace/vllm
VLLM_ASCEND_REPO=/vllm-workspace/vllm-ascend
MODEL_PATH=/models/DeepSeek-V4-Flash-w8a8-mtp

应用 vllm 补丁：

cd "$VLLM_REPO"
git apply --check "$PATCH_DIR/deepseek-v4-agentic-support.patch"
git apply "$PATCH_DIR/deepseek-v4-agentic-support.patch"

如需回退 vllm 补丁：

cd "$VLLM_REPO"
git apply -R --check "$PATCH_DIR/deepseek-v4-agentic-support.patch"
git apply -R "$PATCH_DIR/deepseek-v4-agentic-support.patch"
git diff --stat

说明：

git apply -R 会按补丁反向回退文件内容，适合“刚打完补丁，需要整体撤销”的场景
若 git apply -R --check 失败，通常说明补丁应用后文件又被修改过，当前代码状态已经与补丁上下文不完全一致
如果补丁改动已经单独提交，建议使用 git revert <commit> 回退该提交，而不是再次执行 git apply -R
请检查最新补丁文件。若该文件从 Windows 设备复制而来，你可能需要执行 dos2unix 命令或 sed -i 's/\r$//' /path/to/patch 命令，将 Windows 换行符转换为Unix 格式。

4. 服务启动

单机 TP8 启动：

cd /workspace

export HCCL_OP_EXPANSION_MODE=AIV
export USE_MULTI_BLOCK_POOL=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ACL_OP_INIT_MODE=1
export TRITON_ALL_BLOCKS_PARALLEL=1

vllm serve "$MODEL_PATH" \
  --max_model_len 131072 \
  --max-num-batched-tokens 8192 \
  --served-model-name dsv4 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 16 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --quantization ascend \
  --async-scheduling \
  --additional-config '{"enable_cpu_binding": "true", "multistream_overlap_shared_expert": true}' \
  --speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --port 8000

5. 基本验证

5.1 Reasoning 验证

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")
model = "dsv4"
messages = [
    {
        "role": "user",
        "content": "What is 17*19? Return only the final integer.",
    }
]

# Non-think
resp = client.chat.completions.create(
    model=model,
    messages=messages,
)
print("non_think:", resp.choices[0].message.content)

# Think High
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "high",
        },
    },
)
print("think_high content:", resp.choices[0].message.content)
print("think_high reasoning:", getattr(resp.choices[0].message, "reasoning", None))

# Think Max
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "max",
        },
    },
)
print("think_max content:", resp.choices[0].message.content)
print("think_max reasoning:", getattr(resp.choices[0].message, "reasoning", None))

验证预期：

Non-think 返回最终答案，例如 323
Think High 返回最终答案，并携带 reasoning 字段
Think Max 返回最终答案，并携带 reasoning 字段

5.2 Streaming Function Calling 验证

from collections import defaultdict
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")
tool_calls = defaultdict(lambda: {"name": "", "arguments": ""})
visible_content = []

stream = client.chat.completions.create(
    model="dsv4",
    messages=[
        {
            "role": "user",
            "content": "Call the weather tool for Beijing today.",
        }
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Query weather by city.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"],
                },
            },
        }
    ],
    temperature=0,
    max_tokens=256,
    stream=True,
)

for chunk in stream:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta
    if delta.content:
        visible_content.append(delta.content)
    for tc in delta.tool_calls or []:
        entry = tool_calls[tc.index]
        if tc.function:
            if tc.function.name:
                entry["name"] = tc.function.name
            if tc.function.arguments:
                entry["arguments"] += tc.function.arguments

print("visible_content:", repr("".join(visible_content)))
print("tool_calls:", dict(tool_calls))

验证预期：

可见文本中不应出现 DSML / tool_calls 等内部标记碎片
能正常流式拼出 get_weather
参数应为合法 JSON，例如 {"location": "Beijing"}

本次已覆盖的特性 / 回归场景：

typed tool args：integer / boolean / array / string
多工具调用：单次返回两个 tool calls
多工具调用：streaming 下可正确增量拼接两个 tool calls

6.常见问题

问题: 请问 --speculative-config 中使用 "method": "deepseek_mtp" 和使用 "method": "mtp" 哪个更好？原始部署(https://docs.vllm.ai/projects/ascend/en/v0.13.0/tutorials/DeepSeek-V4.html)使用的是 "deepseek_mtp"，而本文使用的是 "mtp"
答案: 因为 deepseek_mtp 已被弃用，两者使用效果相同
问题：部署时报错：提示transformers需要更新
答案：环境变量TRITON_ALL_BLOCKS_PARALLEL=1不加会报transformers错误