本案例使用A2双机，基于vllm-ascend拉起DeepSeek-V3.1-w8a8模型的推理服务化。

1. 准备权重文件

从modelscope社区下载权重文件。正式下载前，先下载config.json、注意检查量化方式w8a8，昇腾NPU不支持fp8计算精度。

https://modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot

安装modelscope：pip install modelscope

在当前路径下载权重文件：modelscope download --model Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot --local_dir ./

2. 准备镜像

在双机执行拉取镜像命令：docker pull quay.io/ascend/vllm-ascend:v0.11.0rc0。

或者从官网下载镜像包 https://quay.io/repository/ascend/vllm-ascend?tab=tags，再docker load -i <镜像包名>加载镜像。

3. 创建容器

在多节点创建容器，容器名--name保持一致，使用镜像quay.io/ascend/vllm-ascend:v0.11.0rc0 或其他版本的vllm-ascend镜像。

docker run --privileged \
--name deepseek-v31-w8a8 \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
--shm-size=1000g \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /home:/home \
-v /opt:/opt \
-it quay.io/ascend/vllm-ascend:v0.11.0rc0 bash

4. 部署模型

使用ifconfig在物理机查询nic_name；

主从节点分别设置local_ip，从节点设置主节点ip参数node0_ip；

--max-num-seqs需要根据实际场景设置最大并发；

vllm serve 命令中设置模型权重路径。

下面介绍相关环境变量的作用：

nic_name：指定本机的网络接口名称，用于分布式通信中识别物理网络设备，以确保通信库正确绑定到指定网卡，避免多网卡环境下的混淆。在分布式计算中，明确网络接口可减少通信延迟和错误。

local_ip：定义本机的IP地址，作为分布式节点通信的地址标识。local_ip用于设置环境变量（如HCCL_IF_IP）、传递给vllm serve命令的--data-parallel-address参数，确保节点间能正确发现和连接。

HCCL_IF_IP：设置华为集合通信库（HCCL）使用的IP地址。确保HCCL操作绑定到指定IP，避免使用错误网络接口。

VLLM_USE_MODELSCOPE：指示vllm_ascend框架可以从ModelScope平台加载模型，而非本地路径。当设置此变量时，模型路径可简写为vllm-ascend/DeepSeek-V3.1-W8A8，我们仍然使用本地存储的权重文件。

GLOO_SOCKET_IFNAME：为Gloo通信后端（PyTorch的默认CPU通信库）指定网络接口名称。

TP_SOCKET_IFNAME：类似HCCL，确保TP通信绑定到指定网卡。

HCCL_SOCKET_IFNAME：直接指定HCCL库的套接字接口名称，功能类似HCCL_IF_IP。

HCCL_OP_EXPANSION_MODE：配置HCCL操作扩展模式。

PYTORCH_NPU_ALLOC_CONF=expandable_segments:True：控制PyTorch在昇腾NPU上的内存分配行为。expandable_segments:True允许内存段动态扩展，减少内存碎片，用于解决OOM错误。

OMP_PROC_BIND：控制OpenMP线程是否绑定到CPU核心。false表示不绑定，允许线程在核心间迁移，提高灵活性但可能增加延迟。常与OMP_NUM_THREADS配合，用于平衡计算负载。

OMP_NUM_THREADS：设置OpenMP使用的线程数，影响CPU侧并行计算。

HCCL_BUFFSIZE：定义HCCL通信缓冲区大小，用于优化多机通信。

4.1 主节点node0设置环境变量

nic_name="enp67s0f0np0"
local_ip="71.10.29.114"
export HCCL_IF_IP=$local_ip
export VLLM_USE_MODELSCOPE=True
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024

4.2 从节点node1设置环境变量

nic_name="enp67s0f0np0"
local_ip="71.10.29.116"
node0_ip="71.10.29.114"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_MODELSCOPE=True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024

4.3 node0 拉起服务化

vllm serve /opt/data/weights/Deepseek/DeepSeek-V3.1-w8a8 \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--served-model-name ds-v31 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-num-seqs 8 \
--max-model-len 17450 \
--max-num-batched-tokens 17450 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[8]}}'

4.4 node1 拉起服务化

vllm serve /opt/data/weights/Deepseek/DeepSeek-V3.1-w8a8 \
--host 0.0.0.0 \
--port 8000 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--seed 1024 \
--tensor-parallel-size 8 \
--served-model-name ds-v31 \
--max-num-seqs 8 \
--max-model-len 17450 \
--max-num-batched-tokens 17450 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[8]}}'

4.5 curl验证

服务启动后，在任意节点发送curl命令，验证推理服务是否可用。

curl http://0.0.0.0:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "ds-v31",
        "prompt": "The future of AI is",
        "max_tokens": 50,
        "temperature": 0
    }'

5. 功能验证

5.1 图模式

官方文档：https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/feature_guide/graph_mode.html

--enforce-eager 在初始化模型时进行设置用来暂时回退到 eager 单算子模式，即不开启图模式

ACLGraph：这是 vLLM Ascend 支持的默认图形模式。在 v0.9.1rc1 版本中，Qwen 和 Deepseek 系列模型均已通过充分测试。

不配置单算子时则默认开启ACL图模式

TorchairGraph：这是GE图模式。在v0.9.1rc1版本中，仅支持DeepSeek系列模型。

--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[8]}}'

服务化输出日志：[INFO][platform.py:194] NPU 上已启用 TorchAir 编译。正在将 CUDAGraphMode 设置为 NONE。这条日志表明系统检测到 NPU 环境并自动启用了 TorchAir 图编译模式，由于 CUDA Graph 是 NVIDIA GPU 的专有技术，在昇腾 NPU 上不被支持，因此系统将 CUDAGraphMode 设置为 NONE。TorchAir 是昇腾为 PyTorch 提供的图模式扩展库。

5.2 思考模式

DeepSeek-V3.1 模型支持通过更改聊天模板的方式，实现一个模型同时支持思考模式和非思考模式。

通过修改模型权重文件中的 tokenizer_config.json，chat_template 的方式设置 thinking/non-thinking

1、当未定义 thinking，默认设置为非思考模式

{% if not thinking is defined %}{% set thinking = false %}

2、当未定义thinking，默认设置为思考模式

{% if not thinking is defined %}{% set thinking = true %}

5.3 函数调用

函数调用是大语言模型中的一项重要能力，使模型能够理解用户请求中的隐含意图，并将其转化为结构化的函数调用请求。这相当于为模型赋予了"行动能力"，使其不再局限于文本生成，而是能够触发外部系统和API。

参考链接：

Modelers_Park/DeepSeek-V3.1-w8a8 https://modelers.cn/models/Modelers_Park/DeepSeek-V3.1-w8a8#%E5%BF%AB%E9%80%9F%E5%BC%80%E5%A7%8B

修改tokenizer_config.json -- 开启FunctionCall

模型默认权重文件tokenizer_config.json中的chat_tempate未添加tools_calls调用部分的内容，因此需根据官方说明进行相应添加。

将tokenizer_config.json的chat_template进行替换，也可直接使用模型仓中修改完后的tokenizer_config_with_tool_call.json，对原来的文件进行替换：

  "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% if not thinking is defined %}{% set thinking = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, system_prompt='', is_first_sp=true, is_last_user=false) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '\n\n' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{% if tools is defined and tools is not none %}{% set tool_ns = namespace(text='## Tools\nYou have access to the following tools:\n') %}{% for tool in tools %}{% set tool_ns.text = tool_ns.text + '\n### ' + tool.function.name + '\nDescription: ' + tool.function.description + '\n\nParameters: ' + (tool.function.parameters | tojson) + '\n' %}{% endfor %}{% set tool_ns.text = tool_ns.text + \"\nIMPORTANT: ALWAYS adhere to this exact format for tool use:\n<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>tool_call_name<｜tool▁sep｜>tool_call_arguments<｜tool▁call▁end｜>{{additional_tool_calls}}<｜tool▁calls▁end｜>\n\nWhere:\n\n- `tool_call_name` must be an exact match to one of the available tools\n- `tool_call_arguments` must be valid JSON that strictly follows the tool's Parameters Schema\n- For multiple tool calls, chain them directly without separators or spaces\n\" %}{% set ns.system_prompt = ns.system_prompt + '\n\n' + tool_ns.text %}{% endif %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{%- set ns.is_first = false -%}{%- set ns.is_last_user = true -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['tool_calls'] is defined and message['tool_calls'] is not none %}{%- if ns.is_last_user %}{{'<｜Assistant｜></think>'}}{%- endif %}{%- set ns.is_last_user = false -%}{%- set ns.is_first = false %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>'+ tool['function']['name'] + '<｜tool▁sep｜>' + tool['function']['arguments']|tojson + '<｜tool▁call▁end｜>'}}{%- else %}{{message['content'] + '<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['function']['name'] + '<｜tool▁sep｜>' + tool['function']['arguments']|tojson + '<｜tool▁call▁end｜>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'<｜tool▁call▁begin｜>'+ tool['function']['name'] + '<｜tool▁sep｜>' + tool['function']['arguments']|tojson + '<｜tool▁call▁end｜>'}}{%- endif %}{%- endfor %}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- if message['role'] == 'assistant' and (message['tool_calls'] is not defined or message['tool_calls'] is none) %}{%- if ns.is_last_user %}{{'<｜Assistant｜>'}}{%- if message['prefix'] is defined and message['prefix'] and thinking %}{{''}}{%- endif %}{%- endif %}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{%- set content = message['content'] -%}{%- if '</think>' in content %}{%- set content = content.split('</think>', 1)[1] -%}{%- endif %}{{content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_last_user = false -%}{%- set ns.is_tool = true -%}{{'<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endfor -%}{%- if add_generation_prompt and ns.is_last_user and not ns.is_tool %}{{'<｜Assistant｜>'}}{%- if not thinking %}{{'</think>'}}{%- else %}{{'