Step3 多模态推理模型——基于混合专家架构,总参数为 3210 亿,激活参数为 380 亿。 本项目提供 Step3 在昇腾环境下的推理部署指导
本文描述基于 vLLM-Ascend 推理引擎进行裸机模型部署方法。模型部署的基本信息如下:
1)在安装之前,需要确保服务器已经正确安装 Ascend NPU 驱动和固件。运行以下命令进行验证:
npu-smi info如果NPU驱动和固件未安装,或者版本老旧,参考链接进行安装/升级:
https://ascend.github.io/docs/sources/ascend/quick_install.html2) 执行以下命令使能服务器Linux的ip forward功能,以让容器内的推理服务能够被外部访问。
sed -i 's/net\.ipv4\.ip_forward=0/net\.ipv4\.ip_forward=1/g' /etc/sysctl.confsysctl -p | grep net.ipv4.ip_forward将STEP3模型参数下载至数据存储分区。本文所使用服务器的/home目录已独立挂载数据盘,我们以模型文件统一存储于/home/model为例进行说明:
pip install modelscopepip install modelscope
mkdir -p /home/model/step3
modelscope download –model stepfun-ai/step3 --local_dir /home/model/step3 本方式比较简单,适用于模型所依赖的vLLM版本和vllm-Ascend已经发布。可以访问ascend/vllm-ascend · Quay 查询是否已经存在所需要vLLM-Ascend版本的容器镜像。 如果没有可用的容器镜像,参考下一章节的“源码方式安装vLLM/vllm-Ascend”安装vLLM/vLLM-Ascend 本文STEP3部署依赖Vllm-Ascend 0.10.1rc1版本,query.io镜像仓库中已经存在该版本的镜像。
1)运行以下命令下载vLLM-Ascend容器镜像
docker pull quay.io/ascend/vllm-ascend:v0.10.1rc12) 创建容器实例 运行以下命令创建容器实例名为“vllm-step3”的容器
docker run -itd -u 0 --ipc=host --privileged \
-e VLLM_USE_MODELSCOPE=True -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-e ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
--name vllm-step3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /home/model/step3:/model \
-v /opt:/opt \
-p 1025:1025 \
-it quay.io/ascend/vllm-ascend:v0.10.1rc1 bash说明:
docker pull quay.io/ascend/cann:8.2.rc2-910b-ubuntu22.04-py3.11docker run -itd -u 0 --ipc=host --privileged \
-e VLLM_USE_MODELSCOPE=True -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-e ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
--name vllm-builder \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /home/model/step3:/model \
-v /opt:/opt \
-p 1025:1025 \
-it quay.io/ascend/cann:8.2.rc2-910b-ubuntu22.04-py3.11 bashdocker exec -it vllm-builder bash# Using apt-get with mirror
sed -i 's|ports.ubuntu.com|mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list
apt-get update -y && apt-get install -y gcc g++ cmake libnuma-dev wget git curl jq
# Or using yum
# yum update -y && yum install -y gcc g++ cmake numactl-devel wget git curl jq
# Config pip mirror
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# For torch-npu dev version or x86 machine
pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi"mkdir -p /workspace
cd /workspace
git clone --branch v0.10.1.1 https://github.com/vllm-project/vllm.gitcd /workspace
git clone --branch v0.10.1rc1 https://github.com/vllm-project/vllm-ascend.gitcd /workspace/vllm
pip install -r requirements/build.txt
VLLM_TARGET_DEVICE=empty pip install -v -e .cd /workspace/vllm-ascend
pip install -r requirements.txt
pip install -e .注意: 必须严格按照先安装vLLM,然后安装vLLM-Ascend的顺序。
pip list |grep -E 'torch|vllm|transformers'
apt-get clean
pip cache purge
rm -rf /root/.cachedocker commit vllm-builder vllm-ascend:v0.10.1rc1docker run -itd -u 0 --ipc=host --privileged --shm-size 256g \
-e VLLM_USE_MODELSCOPE=True -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-e ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
--name vllm-step3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /home/model/step3:/model \
-v /opt:/opt \
-p 1025:1025 \
-it vllm-ascend:v0.10.1rc1 bash注意: 设置容器共享内存大于128G,因为大参数模型推理过程需要使用比较大的共享内存,
1)宿主机执行以下命令进入容器内
docker exec -it vllm-step3 bash2)创建推理服务启动脚本文件/root/infer.sh infer.sh文件内容如下:
#!/bin/bash
MODEL_PATH=
SERVICE_PORT=1025
MODEL_NAME=
TENSOR_PARALLEL=16
while getopts ":p:m:t:n:" opt; do
case $opt in
p)
SERVICE_PORT="$OPTARG"
echo "service port set with : $SERVICE_PORT"
;;
m)
MODEL_PATH="$OPTARG"
echo "local model path set with : $MODEL_PATH"
;;
n)
MODEL_NAME="$OPTARG"
echo "service name set with : $MODEL_NAME"
;;
t)
TENSOR_PARALLEL="$OPTARG"
echo "Tensor Parallel set with : $TENSOR_PARALLEL"
;;
h)
usage
exit 0
;;
\?)
echo "Error: invalid arg -$OPTARG" >&2
exit 1
;;
:)
echo "Error: arg -$OPTARG needs a valule" >&2
exit 1
;;
esac
done
args_error=
if [ "$MODEL_PATH" = "" ]; then
echo "Error: missing required arg \'-m <model path>\'"
args_error=1
fi
if [ "$SERVICE_PORT" = "" ]; then
echo "Error: missing required arg \'-p <service port>\'"
args_error=1
fi
if [ "$args_error" != "" ]; then
exit 1
fi
if [ "$MODEL_NAME" = "" ]; then
MODEL_NAME=`basename ${MODEL_PATH}`
fi
if [ "$TENSOR_PARALLEL" = "" ]; then
TENSOR_PARALLEL=16
fi
unset HTTP_PROXY
unset HTTPS_PROXY
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/driver/:$LD_LIBRARY_PATH
export HCCL_INTRA_ROCE_ENABLE=1
export HCCL_OP_EXPANSION_MODE=AIV
#export ASCEND_RT_VISIBLE_DEVICES=0
timestamp=$(date +"%Y-%m-%d %H:%M:%S")
VLLM_CMD="vllm serve ${MODEL_PATH} --max-model-len 32768 --port ${SERVICE_PORT} --served-model-name ${MODEL_NAME} -tp $TENSOR_PARALLEL --reasoning-parser step3 --enable-auto-tool-choice --tool-call-parser step3 --trust-remote-code --block-size 16 --gpu_memory_utilization 0.85 --no-enable-prefix-caching "
echo "----------------------------------------------------------"
echo -e "$timestamp] Starting vLLM with command:\n $VLLM_CMD \n"
${VLLM_CMD} 2>&1
3) 执行脚步启动STEP3推理
bash /root/infer.sh -m <模型参数路径> -n step3
4)发送推理请求进行验证:
curl http://127.0.0.1:1025/v1/chat/completions \
-H "Content-Type: application/json" \
-d \
'{
"model": "step3",
"max_tokens": 256, "do_sample": true,
"messages": [
{
"role": "user",
"content": "你是谁"
}
]
}'许可证:apache-2.0 基础模型: