最新镜像: 已发布新的 DeepSeek V4 镜像,已包含本补丁实现,无需再执行补丁,请下载并使用最新镜像:
A2 镜像名称:
quay.io/ascend/vllm-ascend:deepseekv4A3 镜像名称:
quay.io/ascend/vllm-ascend:deepseekv4-a3
DeepSeek V4 已发布新的 vllm-ascend 镜像,推荐优先使用新镜像部署。部署指导请参考官方文档:
新镜像已包含 DeepSeek V4 相关支持,下面的补丁流程主要用于历史 Day 0 镜像场景。
本目录提供的是 DeepSeek V4 最小功能补丁及其对应部署方式。
本次补丁目标是让 DeepSeek-V4-Flash-w8a8-mtp 在 vllm-ascend 上支持以下能力:
--tokenizer-mode deepseek_v4--tool-call-parser deepseek_v4--enable-auto-tool-choice--reasoning-parser deepseek_v4免责声明:
vllm-ascend 上补充 agentic 能力(reasoning / tool calling)。2026-04-30 更新:
2026-04-26 更新:
"arguments" 字段,导致工具调用解析失败的问题reasoning_effort="high" 时的问题2026-04-24 更新:
vllm PR #40760,以及 streaming tool call 相关后续修复参考 vllm PR #40805auto + stream=true 下防止 DSML 片段泄漏的修复来自 issue #40801 对应的 PR #40805基础镜像:
quay.io/ascend/vllm-ascend:v0.13.0rc3quay.io/ascend/vllm-ascend:v0.13.0rc3-openeulerquay.io/ascend/vllm-ascend:v0.13.0rc3-a3quay.io/ascend/vllm-ascend:v0.13.0rc3-a3-openeuler准备变量:
PATCH_DIR=/path/to/patch
VLLM_REPO=/vllm-workspace/vllm
VLLM_ASCEND_REPO=/vllm-workspace/vllm-ascend
MODEL_PATH=/models/DeepSeek-V4-Flash-w8a8-mtp应用 vllm 补丁:
cd "$VLLM_REPO"
git apply --check "$PATCH_DIR/deepseek-v4-agentic-support.patch"
git apply "$PATCH_DIR/deepseek-v4-agentic-support.patch"如需回退 vllm 补丁:
cd "$VLLM_REPO"
git apply -R --check "$PATCH_DIR/deepseek-v4-agentic-support.patch"
git apply -R "$PATCH_DIR/deepseek-v4-agentic-support.patch"
git diff --stat说明:
git apply -R 会按补丁反向回退文件内容,适合“刚打完补丁,需要整体撤销”的场景git apply -R --check 失败,通常说明补丁应用后文件又被修改过,当前代码状态已经与补丁上下文不完全一致git revert <commit> 回退该提交,而不是再次执行 git apply -R单机 TP8 启动:
cd /workspace
export HCCL_OP_EXPANSION_MODE=AIV
export USE_MULTI_BLOCK_POOL=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ACL_OP_INIT_MODE=1
export TRITON_ALL_BLOCKS_PARALLEL=1
vllm serve "$MODEL_PATH" \
--max_model_len 131072 \
--max-num-batched-tokens 8192 \
--served-model-name dsv4 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 16 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--quantization ascend \
--async-scheduling \
--additional-config '{"enable_cpu_binding": "true", "multistream_overlap_shared_expert": true}' \
--speculative-config '{"num_speculative_tokens": 1, "method": "mtp"}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--port 8000from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")
model = "dsv4"
messages = [
{
"role": "user",
"content": "What is 17*19? Return only the final integer.",
}
]
# Non-think
resp = client.chat.completions.create(
model=model,
messages=messages,
)
print("non_think:", resp.choices[0].message.content)
# Think High
resp = client.chat.completions.create(
model=model,
messages=messages,
extra_body={
"chat_template_kwargs": {
"thinking": True,
"reasoning_effort": "high",
},
},
)
print("think_high content:", resp.choices[0].message.content)
print("think_high reasoning:", getattr(resp.choices[0].message, "reasoning", None))
# Think Max
resp = client.chat.completions.create(
model=model,
messages=messages,
extra_body={
"chat_template_kwargs": {
"thinking": True,
"reasoning_effort": "max",
},
},
)
print("think_max content:", resp.choices[0].message.content)
print("think_max reasoning:", getattr(resp.choices[0].message, "reasoning", None))验证预期:
Non-think 返回最终答案,例如 323Think High 返回最终答案,并携带 reasoning 字段Think Max 返回最终答案,并携带 reasoning 字段from collections import defaultdict
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")
tool_calls = defaultdict(lambda: {"name": "", "arguments": ""})
visible_content = []
stream = client.chat.completions.create(
model="dsv4",
messages=[
{
"role": "user",
"content": "Call the weather tool for Beijing today.",
}
],
tools=[
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Query weather by city.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"],
},
},
}
],
temperature=0,
max_tokens=256,
stream=True,
)
for chunk in stream:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if delta.content:
visible_content.append(delta.content)
for tc in delta.tool_calls or []:
entry = tool_calls[tc.index]
if tc.function:
if tc.function.name:
entry["name"] = tc.function.name
if tc.function.arguments:
entry["arguments"] += tc.function.arguments
print("visible_content:", repr("".join(visible_content)))
print("tool_calls:", dict(tool_calls))验证预期:
DSML / tool_calls 等内部标记碎片get_weather{"location": "Beijing"}本次已覆盖的特性 / 回归场景:
integer / boolean / array / string