openPangu-Ultra-MoE-718B-V1.1 是基于昇腾 NPU 训练的大规模混合专家语言模型,总参数量为7180亿,激活参数量为390亿,同一个模型具备快思考和慢思考两种能力。相较[openPangu-Ultra-MoE-718B-V1.0]版本,V1.1版本主要提升了Agent工具调用能力,降低了幻觉率,其他综合能力也进一步增强。
openPangu-Ultra-MoE-718B-V1.1 的模型架构采用了业界主流的 Multi-head Latent Attention (MLA)、Multi-Token Prediction (MTP)、大稀疏比等架构,以及一些特有的设计:
A2开箱环境信息: 驱动版本:25.3.rc1 CANN版本:8.2.RC1 torch版本:2.5.1 torch_npu版本:2.5.1.post1 vllm版本:0.9.2 vllm-ascend版本:v0.9.2rc1 硬件配置:A2 4机32卡 部署镜像:quay.io/ascend/vllm-ascend:v0.9.1-dev
论文参考:Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs
gitcode模型仓:https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1
【vllm用户指南中文站】OpenAI 兼容服务器 - vLLM 文档 【vllm-ascend用户指南】特性与模型 — vllm-ascend
在每个节点上依次执行以下命令。结果必须全部为success,状态必须为UP:
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf获取 NPU IP 地址:
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done节点0:xxx.xxx.xxx.0
ipaddr:29.48.23.147
ipaddr:29.48.20.164
ipaddr:29.48.123.187
ipaddr:29.48.14.255
ipaddr:29.48.95.22
ipaddr:29.48.133.220
ipaddr:29.48.24.29
ipaddr:29.48.153.249节点1:xxx.xxx.xxx.1
ipaddr:29.48.142.157
ipaddr:29.48.38.39
ipaddr:29.48.117.52
ipaddr:29.48.176.227
ipaddr:29.48.126.254
ipaddr:29.48.68.8
ipaddr:29.48.0.167
ipaddr:29.48.87.234节点2:xxx.xxx.xxx.2
节点3:xxx.xxx.xxx.3
跨节点卡间互联测试,注意需要节点间两两测试
# Execute on the target node (replace with actual IP)
hccn_tool -i 0 -ping -g address 10.20.0.20样例:在节点0:xxx.xxx.xxx.0 上进行npu0和节点1:xxx.xxx.xxx.1 npu0(ipaddr:29.48.142.157)互联测试
hccn_tool -i 0 -ping -g address 29.48.142.157返回结果如下,表明跨节点卡间互联正常
device 0 PING 29.48.142.157
recv seq=0,time=1.896000ms
recv seq=1,time=0.078000ms
recv seq=2,time=0.075000ms
3 packets transmitted, 3 received, 0.00% packet lossdocker pull quay.io/ascend/vllm-ascend:v0.9.1-devgit lfs install
git lfs clone https://gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1.git注意校验模型MD5,确保模型传输过程中无损坏
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1-dev
export NAME=openpangu-718b-moe
docker run --rm
--name $NAME
--net=host
--device /dev/davinci0
--device /dev/davinci1
--device /dev/davinci2
--device /dev/davinci3
--device /dev/davinci4
--device /dev/davinci5
--device /dev/davinci6
--device /dev/davinci7
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
-v /usr/local/dcmi:/usr/local/dcmi
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info
-v /etc/ascend_install.info:/etc/ascend_install.info
-v /mnt/nvme0n1/user:/root/.cache
-it $IMAGE bash下载vllm-ascend (v0.9.2rc1),替换镜像内置的vllm-ascend代码(/vllm-workspace/vllm-ascend/)。例如下载Assets中的Source code (tar.gz)v0.9.2rc1.tar.gz至/root/.cache/script/目录
4机32卡服务启动脚本:需要修改MASTER_NODE_IP、当前节点的NODE_RANK、模型路径LOCAL_CKPT_DIR
#!/bin/sh
pip install --no-deps vllm==0.9.2 pybase64==1.4.1
tar -zxvf /root/.cache/script/v0.9.2rc1.tar.gz -C /vllm-workspace/vllm-ascend/ --strip-components=1
export PYTHONPATH=/vllm-workspace/vllm-ascend/:${PYTHONPATH}
yes | cp -r /root/.cache/models/hf_models/openPangu-Ultra-MoE-718B-V1.1/inference/vllm_ascend/* /vllm-workspace/vllm-ascend/vllm_ascend/
if [ -e .torchair_cache ]; then
rm -rf .torchair_cache
fi
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
local_ip=`hostname -I | cut -d' ' -f1`
nic_name=$(ifconfig | grep -B 1 "$local_ip" | head -n 1 | awk '{print $1}' | sed 's/://')
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP=1
MASTER_NODE_IP=xxx.xxx.xxx.xxx # master/head node ip
NODE_RANK=xxx # current node rank (0~3)
NUM_NODES=4 # number of nodes
NUM_NPUS_LOCAL=8 # number of NPUs per node
DATA_PARALLEL_SIZE_LOCAL=1 # DP size per node
LOCAL_CKPT_DIR=/root/.cache/models/hf_models/openPangu-Ultra-MoE-718B-V1.1
HOST=0.0.0.0
if [[ $NODE_RANK -ne 0 ]]; then
headless="--headless"
else
headless=""
fi
vllm serve $LOCAL_CKPT_DIR
--host $HOST
--port 8004
--data-parallel-size $((NUM_NODES*DATA_PARALLEL_SIZE_LOCAL))
--data-parallel-size-local $DATA_PARALLEL_SIZE_LOCAL
--data-parallel-start-rank $((DATA_PARALLEL_SIZE_LOCAL*NODE_RANK))
--data-parallel-address $MASTER_NODE_IP
--data-parallel-rpc-port 13389
--tensor-parallel-size $((NUM_NPUS_LOCAL/DATA_PARALLEL_SIZE_LOCAL))
--seed 1024
--served-model-name pang_ultra_moe
--enable-expert-parallel
--max-num-seqs 4
--max-model-len 6000
--max-num-batched-tokens 6000
--trust-remote-code
--no-enable-prefix-caching
--gpu-memory-utilization 0.95
${headless}
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'Pangu Ultra MoE模型支持使用开源量化框架ModelSlim进行量化,当前模型支持W8A8权重激活量化。
进入到msit/msmodelslim的目录 cd msit/msmodelslim;并在进入的msmodelslim目录下,运行安装脚本,安装accelerate依赖:
bash install.sh
pip install accelerate进入msmodelslim\example\Pangu目录,执行以下量化操作
python3 quant_pangu_ultra_moe_w8a8.py --model_path {浮点权重路径} --save_path {W8A8量化权重路径} --dynamic与BF16模型相比,int8量化模型的config.json会新增以下字段:"mla_quantize": "w8a8","quantize": "w8a8_dynamic"
与BF16相比,增加--quantization ascend
pip install --no-deps vllm==0.9.2 pybase64==1.4.1
tar -zxvf /root/.cache/script/v0.9.2rc1.tar.gz -C /vllm-workspace/vllm-ascend/ --strip-components=1
export PYTHONPATH=/vllm-workspace/vllm-ascend/:${PYTHONPATH}
yes | cp -r /root/.cache/models/hf_models/openPangu-Ultra-MoE-718B-V1.1/inference/vllm_ascend/* /vllm-workspace/vllm-ascend/vllm_ascend/
if [ -e .torchair_cache ]; then
rm -rf .torchair_cache
fi
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
local_ip=`hostname -I | cut -d' ' -f1`
nic_name=$(ifconfig | grep -B 1 "$local_ip" | head -n 1 | awk '{print $1}' | sed 's/://')
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=1024
export VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP=1
MASTER_NODE_IP=xxx.xxx.xxx.xxx # master/head node ip
NODE_RANK=xxx # current node rank (0~1)
NUM_NODES=2 # number of nodes
NUM_NPUS_LOCAL=8 # number of NPUs per node
DATA_PARALLEL_SIZE_LOCAL=1 # DP size per node
LOCAL_CKPT_DIR=/root/.cache/models/hf_models/openPangu-Ultra-MoE-718B-V1.1-w8a8
HOST=0.0.0.0
if [[ $NODE_RANK -ne 0 ]]; then
headless="--headless"
else
headless=""
fi
vllm serve $LOCAL_CKPT_DIR \
--host $HOST \
--port 8004 \
--data-parallel-size $((NUM_NODES*DATA_PARALLEL_SIZE_LOCAL)) \
--data-parallel-size-local $DATA_PARALLEL_SIZE_LOCAL \
--data-parallel-start-rank $((DATA_PARALLEL_SIZE_LOCAL*NODE_RANK)) \
--data-parallel-address $MASTER_NODE_IP \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size $((NUM_NPUS_LOCAL/DATA_PARALLEL_SIZE_LOCAL)) \
--seed 1024 \
--served-model-name pang_ultra_moe \
--enable-expert-parallel \
--max-num-seqs 8 \
--max-model-len 32768 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.96 \
--quantization ascend \
${headless} \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
需要修改MASTER_NODE_IP
curl --location --request POST 'http://xxx.xxx.xxx.xxx:8004/v1/chat/completions'
--header 'Content-Type: application/json'
--data-raw '{
"model": "pang_ultra_moe",
"messages": [
{"role": "user", "content": "请介绍下自己"}
]
}'