基于MindIE最新版本(2.2.RC1),使用昇腾A2双机16卡部署DeepSeek-V3.1-Terminus-w8a8c8模型。本方案包含双机配置文件,并针对不同上下文长度场景提供性能最优配置。
2025年9月22日,DeepSeek-V3.1升级至DeepSeek-V3.1-Terminus版本。此次更新在保留模型既有能力的前提下,针对用户反馈的问题进行了优化,具体包括:
① 语言一致性:减轻了中英文混杂、偶发异常字符等现象;
② Agent能力:进一步提升了Code Agent与Search Agent的性能表现。
DeepSeek-V3.1-Terminus的输出效果较上一版本更为稳定,尤其适用于长文本处理任务与代码生成场景。
以下是该模型在昇腾NPU卡上的部署指南。
模型权重下载地址:https://modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w8a8c8-mtp-QuaRot
| 设备 | 数量 |
|---|---|
| Atlas 800I/800T A2(910B 64G*8) | 2 |
| 交换机(CE9855/9860/XH9210) | 1 |
| 400G-2*200G光纤 | 4 |
服务器需具备网络权限,以便下载镜像和权重文件。 若无法联网,请提前下载相关文件并上传至服务器。
| 软件 | 版本 |
|---|---|
| Python | 3.11.10 |
| PyTorch | 7.2.0 |
| MindIE | 2.2.RC1 |
| Ascend HDK | 25.2.1 |
| CANN | 8.3.RC2 |
请提前将所有昇腾服务器上的HDK和CANN升级至最新版本,升级驱动参考。
注意:选择arm版本,先升级固件,再升级驱动。
升级过程中会导致业务中断,请提前做好变更规划。
通过ssh工具分别登录至昇腾服务器后台,执行下载操作。 登录昇腾社区/昇腾镜像仓库,下载镜像前需申请权限, 待权限申请通过后,根据指南下载对应镜像文件。 权限申请通过后,可使用docker pull命令下载MindIE推理框架镜像:
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.2.RC1-800I-A2-py311-openeuler24.03-lts安装modelscope,通过网络下载模型权重。
pip install modelscope通过modelscope将模型权重文件下载至服务器目录,下载前需确保两台服务器均有足够空间存储权重文件,文件总大小为647GB。
下载过程耗时较长,具体时间取决于网络状况,建议采用nohup方式进行下载。
nohup modelscope download --model Eco-Tech/DeepSeek-V3.1-Terminus-w8a8c8 --local_dir /ai > ds3.1t-w8a8c8-download.log 2>&1 &下载后修改模型目录下的config.json,将第54行的"torch_dtype"的值"bfloat16"改为"float16"。
在一台服务器上下载好之后通过scp复制到另一台服务器相同目录下,以节省时间。
scp /ai/DeepSeek-V3.1-Terminus-w8a8c8 {user}@{另一台服务器ip}:/ai使用docker run命令创建容器并挂载8张npu卡。
docker run -itd --privileged=true --name=ds3.1t-w8a8c8 --net=host --shm-size 800g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device /dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /home/:/home/ \
-v /ai:/ai \
swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.2.RC1-800I-A2-py311-openeuler24.03-lts \
/bin/bash注意在本次部署模型所用到的2台昇腾服务器上都执行该操作。上面命令中挂载的/ai目录中存放的是模型的权重文件,视具体情况修改。
在2台昇腾服务器上执行下列操作: 进入容器:
docker exec -it ds3.1t-w8a8c8 bash该文件指定了每个节点的每个卡的id和卡通信的ip以及序号:
vi ranktable.json输入以下内容:
{
"version": "1.0",
"server_count": "2",
"server_list": [
{
"server_id": "{节点1 IP}",
"container_ip": "{节点1 IP}",
"device": [
{
"device_id": "0",
"device_ip": "{卡通信ip}",
"rank_id": "0"
},
{
"device_id": "1",
"device_ip": "{卡通信ip}",
"rank_id": "1"
},
{
"device_id": "2",
"device_ip": "{卡通信ip}",
"rank_id": "2"
},
{
"device_id": "3",
"device_ip": "{卡通信ip}",
"rank_id": "3"
},
{
"device_id": "4",
"device_ip": "{卡通信ip}",
"rank_id": "4"
},
{
"device_id": "5",
"device_ip": "{卡通信ip}",
"rank_id": "5"
},
{
"device_id": "6",
"device_ip": "{卡通信ip}",
"rank_id": "6"
},
{
"device_id": "7",
"device_ip": "{卡通信ip}",
"rank_id": "7"
}
],
"host_nic_ip": "reserve"
},
{
"server_id": "{节点2 IP}",
"container_ip": "{节点2 IP}",
"device": [
{
"device_id": "0",
"device_ip": "{卡通信ip}",
"rank_id": "8"
},
{
"device_id": "1",
"device_ip": "{卡通信ip}",
"rank_id": "9"
},
{
"device_id": "2",
"device_ip": "{卡通信ip}",
"rank_id": "10"
},
{
"device_id": "3",
"device_ip": "{卡通信ip}",
"rank_id": "11"
},
{
"device_id": "4",
"device_ip": "{卡通信ip}",
"rank_id": "12"
},
{
"device_id": "5",
"device_ip": "{卡通信ip}",
"rank_id": "13"
},
{
"device_id": "6",
"device_ip": "{卡通信ip}",
"rank_id": "14"
},
{
"device_id": "7",
"device_ip": "{卡通信ip}",
"rank_id": "15"
}
],
"host_nic_ip": "reserve"
}
],
"status": "completed"
}配置文件中的device_ip通过在每个节点上执行下面的命令查看:
for i in {0..7};do hccn_tool -i $i -ip -g; doneranktable.json配置文件内容在2个节点上一致。记住该文件全路径,后面配置时要用。
vi env.sh输入以下内容
source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/atb-models/set_env.sh
export MIES_CONTAINER_IP={本机IP}
export RANK_TABLE_FILE={ranktable.json文件全路径}
export MASTER_IP={主节点IP}
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=3
export ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
export HCCL_OP_EXPANSION_MODE="AIV"
export NPU_MEMORY_FRACTION=0.96
export ATB_LLM_HCCL_ENABLE=1
export INF_NAN_MODE_ENABLE=1
# 8244特性防止oom开关
export ATB_LAYER_INTERNAL_TENSOR_REUSE=1
export HCCL_CONNECT_TIMEOUT=3600
export WORLD_SIZE=16
export HCCL_EXEC_TIMEOUT=0
# A3到15,A3双机到33
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export ATB_OPERATION_EXECUTE_ASYNC=1
export ATB_LLM_ENABLE_AUTO_TRANSPOSE=0
export HCCL_RDMA_PCIE_DIRECT_POST_NOSTRICT=TRUE
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export HCCL_BUFFSIZE=64
# 异步双发
export MINDIE_ASYNC_SCHEDULING_ENABLE=1
# jemalloc优化,社区镜像中的路径为/usr/lib64/libjemalloc.so.2;研发镜像中的路径为/usr/lib/aarch64-linux-gnu/libjemalloc.so.2
#export LD_PRELOAD={libjemalloc.so在本环境上的路径}
# 队列优化特性
export TASK_QUEUE_ENABLE=1
for var in $(compgen -e | grep 'STDOUT$'); do
export "$var=0"
done
for var in $(compgen -e | grep 'LOG_TO_FILE$'); do
export "$var=0"
done
# 遇到config.json文件权限问题
find /usr/local/lib/python3.11/site-packages/mindie* -name config.json |xargs chmod -R 640
#chmod -R 640 {双机ranktable的路径}
# 日志开关
export MINDIE_LOG_TO_STDOUT=0
export MINDIE_LOG_TO_FILE=1
export MINDIE_LOG_LEVEL=info
export OMP_NUM_THREADS=10
#HCCL显存分配策略,配置后不会跑着跑着继续申请显存。
export HCCL_ALGO="level0:NA;level1:pipeline"主节点IP一般取第一个节点的IP。保存文件后,使用下面的命令加载环境变量。
source env.sh等待环境变量加载成功。
cd /usr/local/Ascend/mindie/latest/mindie-service/conf
vi config.json如果是短上下文(<32K)推理,使用下面的配置:
{
"Version" : "1.0.0",
"ServerConfig" :
{
"ipAddress" : "{主节点IP}",
"managementIpAddress" : "{主节点IP}",
"port" : 1025,
"managementPort" : 1026,
"metricsPort" : 1027,
"allowAllZeroIpListening" : false,
"maxLinkNum" : 1000,
"httpsEnabled" : false,
"fullTextEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/key_pwd.txt",
"tlsCrlPath" : "security/certs/",
"tlsCrlFiles" : ["server_crl.pem"],
"managementTlsCaFile" : ["management_ca.pem"],
"managementTlsCert" : "security/certs/management/server.pem",
"managementTlsPk" : "security/keys/management/server.key.pem",
"managementTlsPkPwd" : "security/pass/management/key_pwd.txt",
"managementTlsCrlPath" : "security/management/certs/",
"managementTlsCrlFiles" : ["server_crl.pem"],
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"inferMode" : "standard",
"interCommTLSEnabled" : false,
"interCommPort" : 1121,
"interCommTlsCaPath" : "security/grpc/ca/",
"interCommTlsCaFiles" : ["ca.pem"],
"interCommTlsCert" : "security/grpc/certs/server.pem",
"interCommPk" : "security/grpc/keys/server.key.pem",
"interCommPkPwd" : "security/grpc/pass/key_pwd.txt",
"interCommTlsCrlPath" : "security/grpc/certs/",
"interCommTlsCrlFiles" : ["server_crl.pem"],
"openAiSupport" : "vllm",
"tokenTimeout" : 3600,
"e2eTimeout" : 65535,
"distDPServerEnabled":false
},
"BackendConfig" : {
"backendName" : "mindieservice_llm_engine",
"modelInstanceNumber" : 1,
"npuDeviceIds" : [[0,1,2,3,4,5,6,7]],
"tokenizerProcessNumber" : 8,
"multiNodesInferEnabled" : true,
"multiNodesInferPort" : 1120,
"interNodeTLSEnabled" : false,
"interNodeTlsCaPath" : "security/grpc/ca/",
"interNodeTlsCaFiles" : ["ca.pem"],
"interNodeTlsCert" : "security/grpc/certs/server.pem",
"interNodeTlsPk" : "security/grpc/keys/server.key.pem",
"interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt",
"interNodeTlsCrlPath" : "security/grpc/certs/",
"interNodeTlsCrlFiles" : ["server_crl.pem"],
"interNodeKmcKsfMaster" : "tools/pmt/master/ksfa",
"interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb",
"kvPoolConfig" : {"backend":"", "configPath":""},
"ModelDeployConfig" :
{
"maxSeqLen" : 32768,
"maxInputTokenLen" : 32768,
"truncation" : false,
"ModelConfig" : [
{
"modelInstanceType" : "Standard",
"modelName" : "ds3.1t-w8a8c8",
"modelWeightPath" : "/ai/DeepSeek-V3.1-Terminus-w8a8c8/",
"worldSize" : 8,
"cpuMemSize" : 5,
"npuMemSize" : -1,
"backendType" : "atb",
"trustRemoteCode" : false,
"async_scheduler_wait_time": 120,
"kv_trans_timeout": 10,
"kv_link_timeout": 1080,
"dp": 2,
"sp": 1,
"tp": 8,
"cp": 1,
"moe_ep": 4,
"moe_tp": 4,
"plugin_params":"{\"plugin_type\":\"mtp\",\"num_speculative_tokens\": 1}",
"models": {
"deepseekv2": {
"enable_mlapo_prefetch": true,
"kv_cache_options": {"enable_nz": true}
}
},
"llm": {
"parallel_options": {
"dense_mlp_local_tp": 16
}
}
}
]
},
"ScheduleConfig" :
{
"templateType" : "Standard",
"templateName" : "Standard_LLM",
"cacheBlockSize" : 128,
"maxPrefillBatchSize" : 50,
"maxPrefillTokens" : 32768,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 32768,
"maxPreemptCount" : 0,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
},
"LogConfig": {
"dynamicLogLevel" : "",
"dynamicLogLevelValidHours" : 2,
"dynamicLogLevelValidTime" : ""
}
}如果要支持到128K上下文推理,使用下面的配置:
{
"Version" : "1.0.0",
"ServerConfig" :
{
"ipAddress" : "{主节点IP}",
"managementIpAddress" : "{主节点IP}",
"port" : 1025,
"managementPort" : 1026,
"metricsPort" : 1027,
"allowAllZeroIpListening" : false,
"maxLinkNum" : 1000,
"httpsEnabled" : false,
"fullTextEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/key_pwd.txt",
"tlsCrlPath" : "security/certs/",
"tlsCrlFiles" : ["server_crl.pem"],
"managementTlsCaFile" : ["management_ca.pem"],
"managementTlsCert" : "security/certs/management/server.pem",
"managementTlsPk" : "security/keys/management/server.key.pem",
"managementTlsPkPwd" : "security/pass/management/key_pwd.txt",
"managementTlsCrlPath" : "security/management/certs/",
"managementTlsCrlFiles" : ["server_crl.pem"],
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"inferMode" : "standard",
"interCommTLSEnabled" : false,
"interCommPort" : 1121,
"interCommTlsCaPath" : "security/grpc/ca/",
"interCommTlsCaFiles" : ["ca.pem"],
"interCommTlsCert" : "security/grpc/certs/server.pem",
"interCommPk" : "security/grpc/keys/server.key.pem",
"interCommPkPwd" : "security/grpc/pass/key_pwd.txt",
"interCommTlsCrlPath" : "security/grpc/certs/",
"interCommTlsCrlFiles" : ["server_crl.pem"],
"openAiSupport" : "vllm",
"tokenTimeout" : 3600,
"e2eTimeout" : 65535,
"distDPServerEnabled":false
},
"BackendConfig" : {
"backendName" : "mindieservice_llm_engine",
"modelInstanceNumber" : 1,
"npuDeviceIds" : [[0,1,2,3,4,5,6,7]],
"tokenizerProcessNumber" : 8,
"multiNodesInferEnabled" : true,
"multiNodesInferPort" : 1120,
"interNodeTLSEnabled" : false,
"interNodeTlsCaPath" : "security/grpc/ca/",
"interNodeTlsCaFiles" : ["ca.pem"],
"interNodeTlsCert" : "security/grpc/certs/server.pem",
"interNodeTlsPk" : "security/grpc/keys/server.key.pem",
"interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt",
"interNodeTlsCrlPath" : "security/grpc/certs/",
"interNodeTlsCrlFiles" : ["server_crl.pem"],
"interNodeKmcKsfMaster" : "tools/pmt/master/ksfa",
"interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb",
"kvPoolConfig" : {"backend":"", "configPath":""},
"ModelDeployConfig" :
{
"maxSeqLen" : 131072,
"maxInputTokenLen" : 131072,
"truncation" : false,
"ModelConfig" : [
{
"modelInstanceType" : "Standard",
"modelName" : "ds3.1t-w8a8c8",
"modelWeightPath" : "/ai/DeepSeek-V3.1-Terminus-w8a8c8/",
"worldSize" : 8,
"cpuMemSize" : 5,
"npuMemSize" : -1,
"backendType" : "atb",
"trustRemoteCode" : false,
"async_scheduler_wait_time": 120,
"kv_trans_timeout": 10,
"kv_link_timeout": 1080,
"dp": 1,
"sp": 8,
"tp": 8,
"cp": 2,
"moe_ep": 16,
"moe_tp": 1,
"models": {
"deepseekv2": {
"ep_level": 1,
"enable_init_routing_cutoff": true,
"topk_scaling_factor": 0.25,
"enable_oproj_prefetch": true,
"enable_mlapo_prefetch": true,
"kv_cache_options": {"enable_nz": true},
"tool_call_options": {"tool_call_parser": "deepseekv31"},
"chat_template": "/ai/DeepSeek-V3.1-Terminus-w8a8c8/tool_chat_template_deepseekv31.jinja"
}
},
"llm": {
"parallel_options": {
"dense_mlp_local_tp": 16
}
}
}
]
},
"ScheduleConfig" :
{
"templateType" : "Standard",
"templateName" : "Standard_LLM",
"cacheBlockSize" : 128,
"maxPrefillBatchSize" : 10,
"maxPrefillTokens" : 131072,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 20,
"maxIterTimes" : 131072,
"maxPreemptCount" : 0,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
},
"LogConfig": {
"dynamicLogLevel" : "",
"dynamicLogLevelValidHours" : 2,
"dynamicLogLevelValidTime" : ""
}
}主节点IP和环境变量env.sh里面配置的MASTER_IP的值一致。 保存配置文件后就可以拉起服务化了。
使用下面命令拉起服务:
cd /usr/local/Ascend/mindie/latest/mindie-service/
nohup ./bin/mindieservice_daemon > output.log 2>&1 &查看日志命令
tail -f output.log等待5~10分钟,直到日志中显示“Daemon start success!”表示推理服务拉起成功。
在宿主机上面执行curl命令测试接口是否正常返回。
time curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"model": "ds3.1t-w8a8c8",
"messages": [
{
"role": "user",
"content": "中国有哪些景点?"
}
],
"max_tokens": 300,
"temperature": 0.5,
"top_p": 0.96,
"stream": false
}' http://{主节点IP}:1025/v1/chat/completions使用Run_Benchmark工具测试模型推理性能,工具介绍:https://gitcode.com/Ascend-SACT/Run_Benchmark
此工具基于AISBench评测工具开发,集成了文本、图片、视频、音频的数据集,可以方便地对各种模型进行评测。
运行性能测试工具Run_Benchmark之前需要安装好AISBench,安装指导参考https://gitee.com/aisbench/benchmark#-%E5%B7%A5%E5%85%B7%E5%AE%89%E8%A3%85。
安装完成后执行ais_bench -h命令,如果正常回显命令参数则说明安装成功。
打开Run_Benchmark工具首页,下载zip格式源代码。上传到服务器并做如下配置。
vi run_benchmark.sh对下面的测试配置做修改:
VERSION=""
Sever_NAME="" #服务化时使用的model-name
SERVICE_IP="" #本地推理服务器的IP
SERVICE_PORT="" #拉起服务时设置的端口
MODEL_PATH="" #权重路径
#8卡PD混部,下面2个参数都配置为8
D_NUM=8
ALL_NUM=8
#相当于轮次。测试请求总数=并发数*轮次
Concurrent_Multiplier=4
#测试DeepSeek的NLP模型,使用SYN合成数据集即可:
PARAM_SETS=(
"0 1 32768 1024 SYN 0"
"0 2 32768 1024 SYN 0"
"0 4 32768 1024 SYN 0"
"0 8 32768 1024 SYN 0"
)
#上面的例子中表示分别跑4个并发测试任务,任务的并发数分别为1/2/4/8,每个任务都使用64K输入,1K输出,SYN类型的测试集,跑4个轮次。
#填写pip show ais-bench-benchmark命令中显示的Location路径
AISBENCHMARK_PATH=""
#搜索Data_Command_Type,修改数据集名称
#本案例使用MindIE推理框架,对于SYN类型的数据集,使用synthetic_gen数据集,默认脚本中不用修改。保存修改后的文件。
由于测试请求数量越多,测试耗时就越长,因此采用nohup方式启动测试任务:
nohup bash run_benchmark.sh > test_ais_bench.log &等待测试任务运行结束后,在目标目录下查看summary.csv内容即为测试结果。
当前版本每次运行完性能测试脚本后,都会在summary.csv内容末尾追加测试结果,测试结果中包含了TTFT、TPOP、吞吐量、输出吞吐量、单卡吞吐量、QPS、QPM等指标,可以根据测试结果来调整业务参数获得最佳性能。
显存不足,一般是MindIE的config.json文件中参数maxSeqLen、maxInputTokenLen、maxPrefillTokens、maxIterTimes配置不合理导致的,参考官方参数解释说明进行排查。