本文档介绍在A2平台上,基于vLLM推理引擎进行Kimi-K2.5-W4A8模型的推理指导。
Kimi K2.5 是一个开源的原生多模态智能体模型,在 Kimi-K2-Base 基础上通过约 15 万亿混合视觉与文本 token 的持续预训练构建而成。该模型无缝融合了视觉与语言理解能力,并具备高级智能体功能,支持即时模式与思考模式,以及对话式与智能体式交互范式。
硬件环境
| 型号 | 卡数 | 模型 |
|---|---|---|
| 910B3 | 16 | Kimi-K2.5-W4A8 |
软件环境
| 软件名 | 版本 |
|---|---|
| CANN | 8.5.1 |
| Python | Python 3.11.14 |
| hdk | 25.5.0 |
| torch | 2.9.0 |
| torch-npu | 2.9.0 |
备注:参考"1.1"直接拉取v0.17.0rc1镜像即可
docker pull quay.io/ascend/vllm-ascend:v0.17.0rc1
docker run -itd --privileged --network=host \
--name vllm-ascend_v0.17.0rc1 \
--shm-size="200g" \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /home:/home \
quay.io/ascend/vllm-ascend:v0.17.0rc1 bashdocker exec -it vllm-ascend_v0.17.0rc1 bash
modelscope download --model Eco-Tech/Kimi-K2.5-W4A8 --local_dir ./Kimi-K2.5-W4A8
主节点:
#!/bin/sh
export HCCL_IF_IP=71.10.29.127
export GLO0_SOCKET_IFNAME="enp67s0f0np0"
export TP_SOCKET_IFNAME="enp67s0f0np0"
export HCCL_SOCKET_IFNAME="enp67s0f0np0"
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export HCCL_BUFFSIZE=1024
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /home/models/Kimi-K2.5-W4A8 \
--host 0.0.0.0 \
--port 1025 \
--quantization ascend \
--served-model-name Kimi-K2.5 \
--enable-prefix-caching \
--allowed-local-media-path / \
--trust-remote-code \
--async-scheduling \
--tensor-parallel-size 8 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 0 \
--data-parallel-address 71.10.29.127 \
--data-parallel-rpc-port 2358 \
--enable-expert-parallel \
--mm-encoder-tp-mode 'data' \
--mm-processor-cache-gb 2 \
--mm_processor_cache_type="shm" \
--max-num-seqs 128 \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,20,24,28,32,64,96,128]}' \
--additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false},"enable_cpu_binding":true}'从节点:
#!/bin/sh
export HCCL_IF_IP=71.10.29.142
export GLO0_SOCKET_IFNAME="enp67s0f0np0"
export TP_SOCKET_IFNAME="enp67s0f0np0"
export HCCL_SOCKET_IFNAME="enp67s0f0np0"
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export HCCL_BUFFSIZE=1024
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /home/models/Kimi-K2.5-W4A8 \
--host 0.0.0.0 \
--port 1025 \
--headless \
--quantization ascend \
--served-model-name Kimi-K2.5 \
--enable-prefix-caching \
--allowed-local-media-path / \
--trust-remote-code \
--async-scheduling \
--tensor-parallel-size 8 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address 71.10.29.127 \
--data-parallel-rpc-port 2358 \
--enable-expert-parallel \
--mm-encoder-tp-mode 'data' \
--mm-processor-cache-gb 2 \
--mm_processor_cache_type="shm" \
--max-num-seqs 128 \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,20,24,28,32,64,96,128]}' \
--additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false},"enable_cpu_binding":true}'说明:
1. A2双机并行策略:DP2+TP8
2. HCCL_IF_IP为本机业务IP,GLO0_SOCKET_IFNAME为网卡名,均通过ifconfig获取curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
"model": "kimi_k2.5",
"messages": [
{
"role": "user",
"content": "你好,请介绍一下自己"
}
],
"stream": false
}' http://localhost:1025/v1/chat/completions