GLM-4.7 是智谱最新旗舰模型,GLM-4.7 面向 Agentic Coding 场景强化了编码能力、长程任务规划与工具协同,并在多个公开基准的当期榜单中取得开源模型中的领先表现。通用能力提升,回复更简洁自然,写作更具沉浸感。在执行复杂智能体任务,在工具调用时指令遵循更强,Artifacts 与 Agentic Coding 的前端美感和长程任务完成效率进一步提升。 魔塔下载链接:https://www.modelscope.cn/models/Eco-Tech/GLM-4.7-W8A8
| 组件 | 版本 |
|---|---|
| 硬件环境 | A3-8卡 |
| 组件 | 版本 |
|---|---|
| vllm-ascend | 0.14.0.rc1 |
| HDK | Ascend HDK 25.2.3 |
| CANN | 8.5.0 |
| 模型 | GLM 4.7 |
| 优化点 |
|---|
| FUll Decode Only图模式 |
| 异步调度 |
| 流水优化 |
| 高性能内存分配器 |
| 切分策略调整 |
| 添加MTP |
| FIA算子支持fd |
| qkv_rmsnorm_partial_rope融合算子 |
| mul_add 融合算子 |
| 关闭 gmmswigluquant 融合算子 |
| BALANCE_SCHEDULING特性 |
| moe大融合算子 |
| CPU细粒度绑核 |
| 共享专家多流、共享专家dp |
| FalshComm特性 |
1. 基于v0.14.0rc1-a3-openeuler.tar.gz
2. 重装Vllm-ascend:
pip uninstall vllm-ascend
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
git checkout d1dcdfc4084825d2d8f6ff39f1e69767e5f88c40
pip install -v -e .
cd ..由于单算子模式需要频繁地下发算子,会造成Host瓶颈,为缓解这一问题,可采用ACL Graph图模式,实现一次捕获、多次重放,从而减少CPU与框架调度,提升吞吐性能。
pkill -9 python
pkill -9 VLLM
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export VLLM_USE_V1=1
export VLLM_VERSION=0.14.1
python -m vllm.entrypoints.openai.api_server \
--model=XXX/GLM 4.7 \
--served-model-name GLM 4.7 \
--trust-remote-code \
--max-model-len 32768 \
--max-num-batched-tokens 122880 \
-tp 16 \
--port 9998 \
--block-size 128 \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95 从日志中可以看出默认图捕获大小为:[512, 448, 384, 312, 248, 184, 112, 48, 1],默认图模式为PIECEWISE【分段图模式】,针对以上两点可以做相应优化:
通过调整--compilation-config内的cudagraph_capture_sizes参数,来修改图的捕获范围,确保能够覆盖业务中关键的形状或模型结构。通过合理调整这一参数,从而提升性能,需要将吞吐提上去。以本次调优案例为例,需要将并发16加入到cudagraph_capture_sizes数组内,经实测分析,较默认图模式捕获而言,吞吐提升8%。
pkill -9 python
pkill -9 VLLM
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export VLLM_USE_V1=1
export VLLM_VERSION=0.14.1
python -m vllm.entrypoints.openai.api_server \
--model=XXX/GLM 4.7 \
--served-model-name GLM 4.7 \
--trust-remote-code \
--max-model-len 32768 \
--max-num-batched-tokens 122880 \
-tp 8 \
--port 9998 \
--block-size 128 \
--no-enable-prefix-caching \
--compilation-config '{"cudagraph_capture_sizes":[1,2,4,8,16,24,48]}' \
--gpu-memory-utilization 0.95 默认图模式为PIECEWISE图模式,由于Full Graph模式在v0.11.0版本支持FULL_DECODE_ONLY模式,通过设置-compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,24,48]}',即可使用FULL_DECODE_ONLY图模式,从而进一步提升推理性能。经实测,该模式相较于PIECEWISE图模式,吞吐提升25%。
从日志中可以看出,FULL_DECODE_ONLY图模式已成功启用。
异步调度特性【--async-scheduling】能够减少推理过程中token与token之间的空窗等待时间,进而提升整体推理性能。异步推理的启用方法为:在启动推理服务时添加–async-scheduling启动选项。
pkill -9 python
pkill -9 VLLM
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export VLLM_USE_V1=1
export VLLM_VERSION=0.14.1
python -m vllm.entrypoints.openai.api_server \
--model=XXX/GLM 4.7 \
--served-model-name GLM 4.7 \
--trust-remote-code \
--max-model-len 32768 \
--max-num-batched-tokens 122880 \
-tp 8 \
--port 9998 \
--block-size 128 \
--no-enable-prefix-caching \
--async-scheduling \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,24,48]}' \
--gpu-memory-utilization 0.95 性能提升效果:启用异步调度特性后,TOPT较启用前减少0.7ms。
通过此环境变量可配置task_queue算子下发队列是否开启和优化等级。 • 配置为“0”时:关闭task_queue算子下发队列优化,算子下发任务:
• 配置为“1”或未配置时:开启task_queue算子下发队列Level 1优化。 Level 1优化:启用task_queue算子下发队列优化,将算子下发任务分为两段,一部分任务(主要是aclnn算子的调用)放在新增的二级流水上,一、二级流水通过算子队列传递任务,相互并行,通过部分掩盖减少整体的下发耗时,提升端到端性能。
• 配置为“2”时:开启task_queue算子下发队列Level 2优化。 Level 2优化:包含Level 1的优化并进一步平衡了一、二级流水的任务负载,主要是将workspace相关任务迁移至二级流水,掩盖效果更好,性能收益更大。该配置仅在二进制场景生效,建议配置值为Level 2优化。
详细介绍:https://www.hiascend.com/document/detail/zh/Pytorch/600/apiref/Envvariables/Envir_019.html 启用方法:
export TASK_QUEUE_ENABLE=1jemalloc是一款内存分配器,与传统内存分配器(例如,glibc)相比,其最大优势在于减少内存碎片和提升多线程高并发场景下内存的分配效率,进而充分发挥多核多并发的优势。 在内存分配过程中,锁会造成线程等待,对性能影响很大。jemalloc采用如下措施避免线程竞争锁的发生:使用线程变量,每个线程有对应的内存管理器,内存分配在该线程内完成,无需和其它线程竞争锁。 详细参考: https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/softwareinst/instg/instg_0099.html?Mode=PmIns&InstallType=local&OS=openEuler&Software=cannToolKit 使能方法:
export LD_PRELOAD=/usr/local/Ascend/ascend-toolkit/latest/lib64/libjemalloc.so开启HCCL AIV模式,代表通信算法的编排展开位置在Device侧的Vector Core,执行也在Vector Core。 详细可参考: https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/850alpha001/maintenref/envvar/envref_07_0096.html 性能提升效果:使能AIV模式后,TOPT较使能前减少8ms,吞吐提升36%。 使能方法:
export HCCL_OP_EXPANSION_MODE="AIV"由TP16修改为TP8 DP2,性能有进一步的提升。
export HCCL_BUFFSIZE=1024
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export OMP_NUM_THREADS=64
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
export HCCL_DETERMINISTIC="true"
export HCCL_OP_EXPANSION_MODE=AIV
export OMP_NUM_THREADS=1
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl kernel.sched_migration_cost_ns=50000
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_NZ=2
vllm serve /disk1/xxx/GLM-4.7-W8A8 \
--max-model-len 131072 \
--quantization ascend \
--enable-expert-parallel \
--port 8262 \
--served-model-name GLM-4.7-w8a8 \
--reasoning-parser glm45 \
--trust-remote-code \
--gpu_memory_utilization 0.9 \
--async-scheduling \
--max-num-seqs 64 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,4,8,16,32,48,64]}' \
--data-parallel-size 2 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--additional-config '{"cudagraph_mode":"FULL_DECODE_ONLY","ascend_scheduler_config":{"enabled":false},"enable_multistream_moe":false,"chunked_prefill_for_mla":true,"enable_weight_nz_layout":true}'git clone https://gitcode.com/Ascend/msit.git
cd msit/msmodelslim
bash install.sh在msit/msmodelslim/example/DeepSeek/文件夹下,参考add_safetensors.py文件,将mtp权重拷贝至量化权重目录。完成后,还需将config.json修改为包含mtp的新quantization_config。 cd /msit/msmodelslim/example/DeepSeek
from add_safetensors import add_safetensors
add_safetensors(org_paths="/disk1/models/GLM-4.7", target_dir="/disk1/models/GLM-4.7-W8A8/", safetensors_prefix="mtp_float",
max_file_size_gb=5, prefix="model.layers.92.")quant_model_description.json中的quantization_config合并到config.json中
新增配置文件合并脚本configmerge.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
import json
import sys
import argparse
# 直接指定路径
INPUT_DIR = "/disk1/models/GLM-4.7-W8A8" # 修改为你的模型目录
OUTPUT_FILE = "/disk1/models/GLM-4.7-W8A8/config.json" # 修改为你想要的输出文件路径
def merge_configs():
"""
合并配置文件:
1. 从quant_model_description.json读取内容
2. 将内容合并到config.json的quantization_config部分
3. 保存文件config.json
"""
# 构建文件路径
config_path = os.path.join(INPUT_DIR, "config.json")
quant_desc_path = os.path.join(INPUT_DIR, "quant_model_description.json")
# 检查文件存在
if not os.path.exists(config_path):
print(f"错误: 配置文件不存在: {config_path}")
return False
if not os.path.exists(quant_desc_path):
print(f"错误: 量化描述文件不存在: {quant_desc_path}")
return False
try:
# 读取config.json
with open(config_path, 'r', encoding='utf-8') as f:
config_data = json.load(f)
# 读取quant_model_description_w8a8.json
with open(quant_desc_path, 'r', encoding='utf-8') as f:
quant_desc_data = json.load(f)
# 确保config.json有quantization_config字段
if "quantization_config" not in config_data:
config_data["quantization_config"] = {}
# 合并配置
config_data["quantization_config"].update(quant_desc_data)
# 确保有必要的字段
if "moe_quantize" not in config_data:
config_data["moe_quantize"] = "w8a8_dynamic"
# 保存新配置文件
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
json.dump(config_data, f, indent=4)
print(f"成功: 配置已合并并保存到 {OUTPUT_FILE}")
return True
except Exception as e:
print(f"错误: 处理配置文件时发生异常: {str(e)}")
return False
def main():
# 执行合并
success = merge_configs()
if not success:
sys.exit(1)
if __name__ == "__main__":
main()
https://github.com/vllm-project/vllm-ascend/pull/6019
https://github.com/vllm-project/vllm/pull/33423
"ascend_fusion_config": {"fusion_ops_gmmswigluquant": false}替换so包和FIA算子包 替换so包: libopmaster_ct.so libopmaster_rt2.0.so liboptiling.so tiling侧的.so换包路径基本相同 fused_infer_attention_score 8.5的kernel路径在/usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/kernel/ascend910_93/ops_transformer/
mv /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64/libopmaster_ct.so /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64/libopmaster_ct.so_bak
mv /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64/liboptiling.so /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64/liboptiling.so_bak
mkdir -p /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/kernel/ascend910b/ops_transformer/fused_infer_attention_score_bak
mv /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/kernel/ascend910b/ops_transformer/fused_infer_attention_score /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/kernel/ascend910b/ops_transformer/fused_infer_attention_score_bak
cp /mnt/project/g9/glm4.7/FIA/op_tiling_aarch64/libopmaster_ct.so /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64/
cp /mnt/project/g9/glm4.7/FIA/op_tiling_aarch64/libopmaster_rt2.0.so /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64/
cp /mnt/project/g9/glm4.7/FIA/op_tiling_aarch64/liboptiling.so /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64/
cp -r /mnt/project/g9/glm4.7/FIA/910B_opp_kernel_aarch64/fused_infer_attention_score /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/kernel/ascend910b/ops_transformer/
ll /usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/kernel/ascend910b/ops_transformer/fused_infer_attention_score/export VLLM_ASCEND_BALANCE_SCHEDULING=1
注意vllm版本对齐: vllm_ascend\patch\platform\__init__.py和脚本设置的vllm版本export VLLM_ASCEND_ENABLE_FUSED_MC2=1"multistream_overlap_shared_expert": true
"enable_shared_expert_dp": trueexport VLLM_ASCEND_ENABLE_FLASHCOMM1=1要叠加pr:Bowen-Leee 修复分段 mtp 问题 · 拉取请求 #6514 · vllm-project/vllm-ascend
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
export HCCL_OP_EXPANSION_MODE=AIV
export VLLM_RPC_TIMEOUT=3600000
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
export VLLM_ASCEND_BALANCE_SCHEDULING=1
export VLLM_VERSION=0.14.1
export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
export VLLM_TORCH_PROFILER_WITH_STACK=0
export VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export VLLM_ASCEND_ENABLE_FUSED_MC2=2 # 1: dispatchFFNcombine; 2: dispatchgmmcombinedecode
rm -rf /disk1/lcb/profiling/glm_4.7_in4k_out_10_bs_16_no_stack
### 注意,下面的环境变量是在拉起服务端设置,在发送请求的地方设置不会生效。
export VLLM_TORCH_PROFILER_DIR="/disk1/lcb/profiling/glm_4.7_in4k_out_10_bs_16_no_stack"
rm -rf GLM4.7.log
nohup vllm serve /disk1/models/GLM-4.7-W8A8 \
--data-parallel-size 2 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name dsv3 \
--max-model-len 140000 \
--no-enable-prefix-caching \
--max-num-batched-tokens 8192 \
--max-num-seqs 16 \
--async-scheduling \
--quantization ascend \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--speculative-config '{"num_speculative_tokens": 3, "model":"/disk1/models/GLM-4.7-W8A8", "method":"mtp"}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"enable_shared_expert_dp": true, "ascend_compilation_config": {"fuse_qknorm": true}, "ascend_fusion_config": {"fusion_ops_gmmswigluquant": false}, "multistream_overlap_shared_expert":"true"}' \
>>GLM4.7.log &
任务安排:
mooncake kv cache性能验证,你需要先如下几组性能数据:
1. 基线性能,即不添加mooncake kv cache的性能
2. 基线性能 + 添加 mooncake kv cache 的性能
4. 通过修改服务启动命令中的TP个数,分为TP2、TP4、TP8、TP16,需要和ASCEND_RT_VISIBLE_DEVICES对应,比如TP2,则ASCEND_RT_VISIBLE_DEVICES=0,1,TP4则ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
5. 一共2*4组数据,给出最终的数据结果。
注意点:
1. 同时需要保存服务日志和压测日志,方便留存。
2. 日志保存地址:放在本路径下logs文件夹即可,没有的话则新建,若文件夹不存在则需要新创建。
3. 只看第一遍的性能数据,给出分析文档MD,记录保存的性能测试、服务启动命令、压测命令、mooncake接受率,用作归档,并附上压测结论。
启动服务:
1. 容器:qwen3.5_kv_cache
不存在的话就先创建容器:
2. 基线初始启动命令:
确保export PYTHONPATH=/vllm-workspace/vllm:/vllm-workspace/vllm-ascend:${PYTHONPATH}是正确的安装地址。
3. 代码路径:/vllm-workspace/
4. 服务启动命令:
pkill -9 python
pkill -9 VLLM
export TASK_QUEUE_ENABLE=1
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=9,10,11,12
# AIV
#export PYTHONPATH=/vllm-workspace/vllm:/disk1/lcb/vllm-ascend/vllm_ascend_main_1227_30B/vllm-ascend:${PYTHONPATH}
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export LD_PRELOAD=/disk1/lcb/libjemalloc.so:$LD_PRELOAD
export OMP_NUM_THREADS=1
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl kernel.sched_migration_cost_ns=50000
export VLLM_ASCEND_ENABLE_NZ=2
vllm serve /mnt/weight/Qwen3-30B-A3B_w8a8 \
--served-model-name qwen3 \
--dtype bfloat16 \
--max_model_len 23000 \
--quantization ascend \
--max-num-batched-tokens 40960 \
--tensor-parallel-size 4 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--no-enable_expert_parallel \
--gpu-memory-utilization 0.90 \
--profiler-config '{"profiler": "torch", "torch_profiler_dir": "/data/lcb/profiling/qwen3-30B","torch_profiler_with_stack": false}' \
--no-enable-prefix-caching \
--host 0.0.0.0 \
--port 8889 \
--block-size 128 \
--async-scheduling \
--distributed_executor_backend "mp" \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY","cudagraph_capture_sizes":[4,8,16,32,64,96,128]}'
3. 拉起mooncake kv cache服务端:
3.1 先新建mooncake服务端文件mooncake.json,可以放在/data1路径下:
local_hostname需要修改成本机host的IP。
master_server_address需要修改成本机host的IP+新建一个端口号,你可以设置成50088。
{
"local_hostname": "xxx",
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
"device_name": "",
"use_ascend_direct": true,
"alloc_in_same_node": true,
"master_server_address": "xxx:50088",
"global_segment_size": 103079215104
}
3.2 启动mooncake kv cache 服务端:
mooncake_master --port 50088
3.3 修改原始配置文件:
3.3.1 添加环境变量:
#指定mooncake配置文件
export LD_LIBRARY_PATH=/usr/local/lib/:/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake/:$LD_LIBRARY_PATH
export MOONCAKE_CONFIG_PATH="xxx/mooncake.json"
export ACL_OP_INIT_MODE=1
export ASCEND_BUFFER_POOL=4:8
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000
3.3.2 启动命令里面添加:
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"backend": "mooncake",
"kvpool_rpc_port":"0"
}
}'
4. 压测工具,新创建一个test容器,用刚刚的创建容器命令:
python /xxx/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py --url http://127.0.0.1:10113 --model xxx --served-model-name xxx --seed 1234 --input-file xxx/vllm/benchmarks/multi_turn/generate_multi_turn.json --num-clients 16 --max-active-conversations 64 --no-early-stop
参考服务启动命令,帮我修改--model、--served-model-name、--url、--input-file xxx/vllm/benchmarks/multi_turn/generate_multi_turn.json,若保持一致则无需修改
数据集:wget https://www.gutenberg.org/ebooks/1184.txt.utf-8 mv 1184.txt.utf-8 pg1184.txt
代理配置:
5. 修改点:
(1) generate_multi_turn.json和/xxx/vllm/benchmarks/multi_turn/benchmark_serving_multi_turn.py在vllm的安装路径下面,其中的num_conversations要和max-active-conversations保持一致。
(2) num-clients需要配置为TP的16倍,max-active-conversations为num-clients的4倍,比如TP2时,num-clients为32. max-active-conversations为128.
diff --git a/vllm_ascend/patch/platform/patch_kv_cache_utils.py b/vllm_ascend/patch/platform/patch_kv_cache_utils.py
index 777d29c0..6ad63b96 100644
--- a/vllm_ascend/patch/platform/patch_kv_cache_utils.py
+++ b/vllm_ascend/patch/platform/patch_kv_cache_utils.py
@@ -372,7 +372,14 @@ def _get_kv_cache_groups_uniform_page_size_with_multi_groups(
# group_size = 22
# TODO(lxs): generalize the logic for determining group size.
# Now, we use num_hidden_layers // 2 as the group size for DSV4.
- group_size = cdiv(len(kv_cache_spec_list), 2)
+ # MTP layers are extra dense draft layers; keep them on the non-compress
+ # side instead of letting them shrink the number of KV tensors and drop the
+ # tail main-model layer from allocation.
+ num_mtp_layers = sum(
+ 1 for layer_name in kv_cache_spec_list
+ if ".mtp." in f".{layer_name}."
+ )
+ group_size = cdiv(len(kv_cache_spec_list) - num_mtp_layers, 2) + num_mtp_layers
grouped_layers = []
group_layer_specs = []
for layer_spec, layers in same_type_layers.items():
@@ -636,6 +643,21 @@ def get_kv_cache_config_from_groups_multispec(
# full.0, sw.0, sw.1: share a Tensor with size=available_memory//2
# full.1, sw.2: share another Tensor with size=available_memory//2
group_size = max(len(group.layer_names) for group in kv_cache_groups)
+ unique_layer_names = {
+ layer_name
+ for group in kv_cache_groups
+ for layer_name in group.layer_names
+ }
+ num_mtp_layers = sum(
+ 1 for layer_name in unique_layer_names
+ if ".mtp." in f".{layer_name}."
+ )
+ if num_mtp_layers:
+ group_size = max(
+ group_size,
+ cdiv(len(unique_layer_names) - num_mtp_layers, 2) +
+ num_mtp_layers,
+ )
page_size = get_uniform_page_size(
[group.kv_cache_spec for group in kv_cache_groups])
@@ -657,18 +679,13 @@ def get_kv_cache_config_from_groups_multispec(
for layer_name in kv_cache_groups[j].layer_names:
if layer_name in allocate_complete_layers:
continue
- group_used = False
- for gid in layer_kv_cache_group_idx[layer_name]:
- if gid in used_group_idx_set:
- group_used = True
- break
- else:
- used_layer_kv_cache_group_idx[layer_name].add(gid)
- if group_used is True:
+ group_idxs = layer_kv_cache_group_idx[layer_name]
+ if any(gid in used_group_idx_set for gid in group_idxs):
continue
shared_by.append(layer_name)
- used_group_idx_set.extend(layer_kv_cache_group_idx[layer_name])
- if len(used_layer_kv_cache_group_idx[layer_name]) == len(layer_kv_cache_group_idx[layer_name]):
+ used_group_idx_set.extend(group_idxs)
+ used_layer_kv_cache_group_idx[layer_name].update(group_idxs)
+ if len(used_layer_kv_cache_group_idx[layer_name]) == len(group_idxs):
allocate_complete_layers.append(layer_name)
kv_cache_tensors.append(
KVCacheTensor(size=page_size * num_blocks,
diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py
index 8296dc47..dcf2ab40 100644
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -2918,7 +2918,24 @@ class NPUModelRunner(GPUModelRunner):
if layer_name in self.runner_only_attn_layers:
continue
layer_names.add(layer_name)
- assert layer_names == set(kv_cache_raw_tensors.keys()), "Some layers are not correctly initialized"
+ initialized_layer_names = set(kv_cache_raw_tensors.keys())
+ if layer_names != initialized_layer_names:
+ missing_layers = sorted(layer_names - initialized_layer_names)
+ unexpected_layers = sorted(initialized_layer_names - layer_names)
+ logger.error(
+ "KV cache initialization mismatch: expected=%d initialized=%d "
+ "missing=%s unexpected=%s kv_cache_groups=%d kv_cache_tensors=%d "
+ "tensor_shared_by_sample=%s runner_only_attn_layers=%s",
+ len(layer_names),
+ len(initialized_layer_names),
+ missing_layers[:64],
+ unexpected_layers[:64],
+ len(kv_cache_config.kv_cache_groups),
+ len(kv_cache_config.kv_cache_tensors),
+ [list(t.shared_by) for t in kv_cache_config.kv_cache_tensors[:8]],
+ sorted(self.runner_only_attn_layers)[:64],
+ )
+ assert layer_names == initialized_layer_names, "Some layers are not correctly initialized"
return kv_cache_raw_tensors