Ascend-SACT/Kimi-K2.5-W4A8
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

Kimi-K2.5-W4A8模型 A2-双机 部署教程

1. 概述

本文档介绍在A2平台上,基于vLLM推理引擎进行Kimi-K2.5-W4A8模型的推理指导。

1.1.Kimi-K2.5-W4A8模型介绍

Kimi K2.5 是一个开源的原生多模态智能体模型,在 Kimi-K2-Base 基础上通过约 15 万亿混合视觉与文本 token 的持续预训练构建而成。该模型无缝融合了视觉与语言理解能力,并具备高级智能体功能,支持即时模式与思考模式,以及对话式与智能体式交互范式。

1.2. 运行环境准备

硬件环境

型号卡数模型
910B316Kimi-K2.5-W4A8

软件环境

软件名版本
CANN8.5.1
PythonPython 3.11.14
hdk25.5.0
torch2.9.0
torch-npu2.9.0

备注:参考"1.1"直接拉取v0.17.0rc1镜像即可

1. 准备镜像

1.1. 拉取基础镜像

docker pull quay.io/ascend/vllm-ascend:v0.17.0rc1

1.2. 创建基础容器并进入

docker run -itd --privileged --network=host \
    --name vllm-ascend_v0.17.0rc1 \
    --shm-size="200g" \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /etc/hccn.conf:/etc/hccn.conf \
    -v /home:/home \
    quay.io/ascend/vllm-ascend:v0.17.0rc1 bash

docker exec -it vllm-ascend_v0.17.0rc1 bash

2. 下载权重

modelscope download --model Eco-Tech/Kimi-K2.5-W4A8 --local_dir ./Kimi-K2.5-W4A8

3. 拉起服务

主节点:

#!/bin/sh
export HCCL_IF_IP=71.10.29.127
export GLO0_SOCKET_IFNAME="enp67s0f0np0"
export TP_SOCKET_IFNAME="enp67s0f0np0"
export HCCL_SOCKET_IFNAME="enp67s0f0np0"
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export HCCL_BUFFSIZE=1024
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

vllm serve /home/models/Kimi-K2.5-W4A8 \
    --host 0.0.0.0 \
    --port 1025 \
    --quantization ascend \
    --served-model-name Kimi-K2.5 \
    --enable-prefix-caching \
    --allowed-local-media-path / \
    --trust-remote-code \
    --async-scheduling \
    --tensor-parallel-size 8 \
    --data-parallel-size 2 \
    --data-parallel-size-local 1 \
    --data-parallel-start-rank 0 \
    --data-parallel-address 71.10.29.127 \
    --data-parallel-rpc-port 2358 \
    --enable-expert-parallel \
    --mm-encoder-tp-mode 'data' \
    --mm-processor-cache-gb 2 \
    --mm_processor_cache_type="shm" \
    --max-num-seqs 128 \
    --max-model-len 32768 \
    --max-num-batched-tokens 16384 \
    --gpu-memory-utilization 0.92 \
    --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,20,24,28,32,64,96,128]}' \
    --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false},"enable_cpu_binding":true}'

从节点:

#!/bin/sh
export HCCL_IF_IP=71.10.29.142
export GLO0_SOCKET_IFNAME="enp67s0f0np0"
export TP_SOCKET_IFNAME="enp67s0f0np0"
export HCCL_SOCKET_IFNAME="enp67s0f0np0"
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export HCCL_BUFFSIZE=1024
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

vllm serve /home/models/Kimi-K2.5-W4A8 \
    --host 0.0.0.0 \
    --port 1025 \
    --headless \
    --quantization ascend \
    --served-model-name Kimi-K2.5 \
    --enable-prefix-caching \
    --allowed-local-media-path / \
    --trust-remote-code \
    --async-scheduling \
    --tensor-parallel-size 8 \
    --data-parallel-size 2 \
    --data-parallel-size-local 1 \
    --data-parallel-start-rank 1 \
    --data-parallel-address 71.10.29.127 \
    --data-parallel-rpc-port 2358 \
    --enable-expert-parallel \
    --mm-encoder-tp-mode 'data' \
    --mm-processor-cache-gb 2 \
    --mm_processor_cache_type="shm" \
    --max-num-seqs 128 \
    --max-model-len 32768 \
    --max-num-batched-tokens 16384 \
    --gpu-memory-utilization 0.92 \
    --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,20,24,28,32,64,96,128]}' \
    --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false},"enable_cpu_binding":true}'

说明:

1. A2双机并行策略:DP2+TP8
2. HCCL_IF_IP为本机业务IP,GLO0_SOCKET_IFNAME为网卡名,均通过ifconfig获取

4. 快速验证

curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
    "model": "kimi_k2.5",
    "messages": [
        {
            "role": "user",
            "content": "你好,请介绍一下自己"
        }
    ],
    "stream": false
}' http://localhost:1025/v1/chat/completions