本文基于 Kimi-K2.5 模型的 vLLM-Ascend 推理指导,进一步补充了该模型 W4A8 量化版本在 Ascend+X86 硬件环境下基于vLLM推理引擎的部署指南;同时采用了 ascend-docker 统一镜像,帮助用户与开发者快速基于昇腾平台开展模型部署及应用创新。
原推理指导:https://github.com/LoganJane/vllm-ascend/blob/main/README.md
ascend-docker:https://gitcode.com/Ascend-SACT/ascend-docker
Kimi-K2.5-w4a8 是 Moonshot AI 开源大模型 Kimi-K2.5 的 W4A8 量化版本,在保留原模型核心能力的基础上,针对实际部署场景进行轻量化与推理效率优化,是面向生产环境的高效轻量化大模型方案。
硬件环境
| 硬件 | 卡数 | 模型 |
|---|---|---|
| NPU:Ascend 910B2 / CPU:X86 | 16 | Kimi-K2.5-W4A8 |
软件环境
| 软件名 | 版本 |
|---|---|
| CANN | 8.5.0 |
| Python | Python 3.11.14 |
| torch | 2.9.0+cpu |
| torch-npu | 2.9.0 |
docker pull -platform=amd64 swr.cn-north-4.myhuaweicloud.com/ascend-sact/ascend-910b-ubuntu:v3.0modelscope download --model Eco-Tech/Kimi-K2.5-W4A8 --local_dir xx/Kimi-K2.5-W4A8# 下载vllm指定仓源代码
# 方案一:
git clone -b main https://github.com/LoganJane/vllm.git
# 方案二,如果上面命令网络超时,则可以采用下面的命令下载:
git clone -b main https://gh-proxy.com/https://github.com/LoganJane/vllm.git# 进入代码克隆自动创建的vllm目录内
cd vllm
# 注意vllm路径下还嵌套vllm目录,不要进入,需要在setup.py相同目录下执行如下命令
# 激活 ascend-infer 的虚拟环境
conda activate ascend-infer
# 卸载已经安装的 vllm
pip uninstall -y vllm
# 加载昇腾(Ascend)相关工具包的环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
# 跳过校验安装vllm
python setup.py install --force
如果安装失败出现以下信息,则务必参考第8.1 小节 FAQ安装失败规避办法处理
File "...../vllm-ascend/setup.py", line 88, in build
subprocess.check_call(["cmake", *build_args], cwd=self.build_temp)
File "/opt/mamba/envs/ascend-infer/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '..', '-j=92', '--target=vllm_ascend_c']' returned安装期间系统会自动到github下载oneDNN,如果出现以下信息,则意味着github连接失败请参考第8.2小节FAQ-无法下载oneDNN.git章节规避。
fatal: unable to access 'https://github.com/oneapi-src/oneDNN.git/': Failed to connect to github.com port 443 after 131074 ms: Couldn't connect to server需要检查一下安装后,系统的安装目录中是否是当前安装的新的文件,通过检查一下文件夹创建时间来判断,如果是当前时间,则正确。
执行命令:ll -d /opt/mamba/envs/ascend-infer/lib/python3.11/site-packages/vllm* 执行结果:drwxr-xr-x 33 root root 4096 Mar 5 11:55 /opt/mamba/envs/ascend-infer/lib/python3.11/site-packages/vllm/ .......
文件夹vllm创建时间为当前时间。
# 下载vllm-ascend指定仓源代码
# 方案一:
git clone -b main https://github.com/LoganJane/vllm-ascend.git
# 方案二,如果上面命令网络超时,则可以采用下面的命令下载:
git clone -b main https://gh-proxy.com/https://github.com/LoganJane/vllm-ascend.git# 进入代码克隆自动创建的vllm-ascend目录内
cd vllm-ascend
# 注意vllm-ascend路径下还嵌套vllm-ascend目录,不要进入,需要在setup.py相同目录下执行如下命令
# 激活 ascend-infer 的虚拟环境
conda activate ascend-infer
# 卸载已经安装的 vllm-ascend
pip uninstall -y vllm-ascend
# 加载昇腾(Ascend)相关工具包的环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
# 跳过校验安装vllm
python setup.py install --force
如果安装失败出现以下信息,则务必参考第8.1 小节 FAQ安装失败规避办法处理
File "...../vllm-ascend/setup.py", line 88, in build
subprocess.check_call(["cmake", *build_args], cwd=self.build_temp)
File "/opt/mamba/envs/ascend-infer/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '..', '-j=92', '--target=vllm_ascend_c']' returned安装期间系统会自动到github下载oneDNN,如果出现以下信息,则意味着github连接失败请参考第8.2小节FAQ-无法下载oneDNN.git章节规避。
fatal: unable to access 'https://github.com/oneapi-src/oneDNN.git/': Failed to connect to github.com port 443 after 131074 ms: Couldn't connect to server需要检查一下安装后,系统的安装目录中是否是当前安装的新的文件,通过检查一下文件夹创建时间来判断,如果是当前时间,则正确。
执行命令:ll -d /opt/mamba/envs/ascend-infer/lib/python3.11/site-packages/vllm* 执行结果:drwxr-xr-x 33 root root 4096 Mar 5 12:05 /opt/mamba/envs/ascend-infer/lib/python3.11/site-packages/vllm-ascend/ .......
文件夹vllm-ascend创建时间为当前时间。
vi编辑infer_kimi-k2.5-w4a8.sh推理脚本
export VLLM_USE_V1=1
# 调度优化
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export TASK_QUEUE_ENABLE=1
# 通信优化
export HCCL_BUFFSIZE=1024
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_ASCEND_ENABLE_FLASHCOMM=0
export VLLM_TORCH_PROFILER_WITH_STACK=0
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
export VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1
export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=1
export HCCL_INTRA_ROCE_ENABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=8
export ATB_LLM_HCCL_ENABLE=1
export ATB_OPERATION_EXECUTE_ASYNC=2
vllm serve /模型权重下载目录\
--served-model-name kimi \
--host 0.0.0.0 \
--port 8016 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 16384 \
--max-num-batched-tokens 1024 \
--trust-remote-code \
--quantization ascend \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching \
--additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":true},"torchair_graph_config":{"enabled":true}}' \
--compilation-config '{"cudagraph_capture_sizes":[1,2,4,8,16,32,64],"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--kv-cache-dtype auto \
--seed 1024conda activate ascend-infer
bash infer_kimi-k2.5-w4a8.shcurl http://0.0.0.0:8016/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "kimi",
"messages": [
{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
{"role": "user", "content": [{"type": "text", "text": "请介绍一下上海"}]}
],
"temperature": 0.6,
"max_tokens": 256
}'File "...../vllm-ascend/setup.py", line 88, in build
subprocess.check_call(["cmake", *build_args], cwd=self.build_temp)
File "/opt/mamba/envs/ascend-infer/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '..', '-j=92', '--target=vllm_ascend_c']' returned重新执行python setup install --force前,先清除历史编译信息:
cd xxx (前期git clone下载的vllm/vllm-ascend路径)
rm -rf build csrc/build dist *.egg-info
find . -name "CMakeCache.txt" -delete
find . -name "CMakeFiles" -type d -exec rm -rf {} +
rm -rf /tmp/vllm*fatal: unable to access 'https://github.com/oneapi-src/oneDNN.git/': Failed to connect to github.com port 443 after 131074 ms: Couldn't connect to server
通过Vi命令在/etc/hosts文件增加两行github IP地址说明
140.82.112.4 github.com
199.232.69.194 github.global.ssl.fastly.net
如果/etc/hosts增加github.com地址后仍无法下载,尝试修改vllm/cmake/cpu_extension.cmak文件, 将oneDNN下载地址从github修改为gitee,具体操作如下:
vi vllm/cmake/cpu_extension.cmak 将 GIT_REPOSITORY https://github.com/oneapi-src/oneDNN.git 替换为 GIT_REPOSITORY https://gitee.com/wylucky/oneDNN.git