本文档仅记录一条最小可复现路径:
quay.io/ascend/vllm-ascend:v0.13.0rc3-a3 新建容器/vllm-workspace/vllm 打补丁dsv4 服务补丁文件:
/data1/lcb/xxx/dsv4_013/dsv4_function_call_live.patch验证用启动脚本:
/data1/lcb/xxx/dsv4_013/run_ds_v4_013_fc_validation.sh模型路径:
/mnt/share_space/xxx/models/DeepSeek-V4-Flash-w8a8docker run -itd --privileged --name=ds_v4_013_fc --net=host \
--shm-size 500g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device /dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /home:/home \
-v /data1:/data1 \
-v /data2:/data2 \
-v /data3:/data3 \
-v /opt:/opt \
-v /home:/home \
-v /mnt:/mnt \
--entrypoint /bin/bash \
quay.io/ascend/vllm-ascend:v0.13.0rc3-a3建议先确认容器镜像来源:
docker inspect ds_v4_013_fc --format '{{.Config.Image}} {{.Image}}'预期:
quay.io/ascend/vllm-ascend:v0.13.0rc3-a3 sha256:7f078ea3f8c35aee9e41b2b4a243d1bb65da72390fa5d50b5827510c577a5c31进入 vllm 工作树做检查:
docker exec ds_v4_013_fc bash -lc 'cd /vllm-workspace/vllm && git status --short'先检查 patch 是否可应用:
docker exec ds_v4_013_fc bash -lc 'cd /vllm-workspace/vllm && git apply --check /data1/lcb/xxx/dsv4_013/dsv4_function_call_live.patch'正式应用补丁:
docker exec ds_v4_013_fc bash -lc 'cd /vllm-workspace/vllm && git apply /data1/lcb/xxx/dsv4_013/dsv4_function_call_live.patch'建议做一次静态校验:
docker exec ds_v4_013_fc bash -lc 'python3 -m py_compile \
/vllm-workspace/vllm/vllm/config/model.py \
/vllm-workspace/vllm/vllm/tokenizers/registry.py \
/vllm-workspace/vllm/vllm/tool_parsers/__init__.py \
/vllm-workspace/vllm/vllm/tool_parsers/deepseekv32_tool_parser.py \
/vllm-workspace/vllm/vllm/tool_parsers/deepseekv4_tool_parser.py \
/vllm-workspace/vllm/vllm/tokenizers/deepseek_v4.py \
/vllm-workspace/vllm/vllm/tokenizers/deepseek_v4_encoding.py \
/vllm-workspace/vllm/vllm/entrypoints/openai/serving_engine.py'当前验证使用的启动方式是直接在容器内执行:
docker exec -d ds_v4_013_fc bash -lc '/data1/lcb/xxx/dsv4_013/run_ds_v4_013_fc_validation.sh >/data1/lcb/xxx/dsv4_013/logs/dsv4_fc_md_validation_ds_v4_013_fc_20260424_225930.log 2>&1'这个脚本内部使用的关键环境和启动参数与验证时保持一致:
export PYTHONPATH=/vllm-workspace/vllm:/vllm-workspace/vllm-ascend:${PYTHONPATH}
export USE_MULTI_BLOCK_POOL=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
export ACL_OP_INIT_MODE=1
export VLLM_VERSION=0.13.0
export VLLM_TORCH_PROFILER_DIR=./vllm_profile
export VLLM_TORCH_PROFILER_WITH_STACK=0
export TRITON_ALL_BLOCKS_PARALLEL=1
vllm serve /mnt/share_space/xxx/models/DeepSeek-V4-Flash-w8a8 \
--host 0.0.0.0 \
--max_model_len 65536 \
--max-num-batched-tokens 8192 \
--served-model-name dsv4 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 16 \
--data-parallel-size 1 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--quantization ascend \
--port 8006 \
--block-size 128 \
--chat-template /mnt/share_space/xxx/models/DeepSeek-V4-Flash-w8a8/chat_template.jinja \
--async-scheduling \
--trust-remote-code \
--tokenizer-mode deepseek_v4 \
--enable-auto-tool-choice \
--tool-call-parser deepseek_v4 \
--additional-config '{"enable_cpu_binding": "true", "multistream_overlap_shared_expert": true}' \
--speculative-config '{"num_speculative_tokens": 1,"method": "deepseek_mtp"}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'启动日志:
/data1/lcb/xxx/dsv4_013/logs/dsv4_fc_md_validation_ds_v4_013_fc_20260424_225930.log日志中出现以下内容,表示服务已经拉起:
INFO: Started server process [3033]
INFO: Waiting for application startup.
INFO: Application startup complete./v1/models 返回 200docker exec ds_v4_013_fc bash -lc 'curl --noproxy "*" -sS http://127.0.0.1:8006/v1/models'验证成功时,返回中应包含:
{
"id": "dsv4",
"root": "/mnt/share_space/xxx/models/DeepSeek-V4-Flash-w8a8"
}tool_choice:function 成功验证成功时,应满足:
tool_callstool_calls[0].function.name 正确例如:
{
"finish_reason": "stop",
"tool_calls": [
{
"function": {
"name": "add_numbers",
"arguments": "{\"a\": 23, \"b\": 19}"
}
}
]
}tool_choice:auto 成功验证成功时,应满足:
finish_reason = "tool_calls"例如:
{
"finish_reason": "tool_calls",
"tool_calls": [
{
"function": {
"name": "get_weather",
"arguments": "{\"city\": \"北京\"}"
}
}
]
}对于 yidun 这类审核工具的调用,验证成功时,应至少看到:
{
"finish_reason": "tool_calls",
"tool_calls": [
{
"function": {
"name": "yidun",
"arguments": "{\"content\": \"hello\"}"
}
}
]
}如果使用强制命名函数调用模板,还可以进一步验证:
arguments.content 与目标文本完全一致auto case 成功以下请求也已在 ds_v4_013_fc 中实测通过:
{
"model": "dsv4",
"stream": false,
"messages": [
{
"role": "system",
"content": "使用工具返回"
},
{
"role": "user",
"content": "检测你好是敏感词吗"
}
],
"tool_choice": "auto",
"tools": [
{
"type": "function",
"function": {
"name": "检测工具",
"description": "An agent designed to detect user input in conversations and identify whether it contains sensitive words.",
"parameters": {
"properties": {
"content": {
"description": "待检测内容",
"type": "string"
}
},
"required": ["content"],
"type": "object"
}
}
}
]
}结果文件:
/data1/lcb/xxx/dsv4_013/results/dsv4_case_chinese_tool_auto_20260425.json验证成功时,响应中可看到:
{
"finish_reason": "tool_calls",
"tool_calls": [
{
"function": {
"name": "检测工具",
"arguments": "{\"content\": \"你好\"}"
}
}
]
}说明:
检测工具 可以命中 tool_callstool_choice = "auto" 可正常工作arguments.content 当前会抽取核心检测词 你好,不是整句 检测你好是敏感词吗当前仍然成立的已知点:
finish_reason 仍可能是 stoptool_calls 本身生成tool_choice = "required" 在当前 async scheduling + speculative decoding 组合下仍可能触发 400tool_choice = "auto" 且 user 使用自然问句时,审核类工具参数可能只抽取关键词,不保证整句原样透传