Ascend-SACT/Gemma-4
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

Gemma 4 补丁应用与运行手册

1. 简介

本目录提供的是本次 patch 交付物:

  • 0001-Add-Gemma4-support-in-vllm.patch
  • 0002-Add-Gemma4-runtime-fixes-in-vllm-ascend.patch

其中:

  • 0001-Add-Gemma4-support-in-vllm.patch
    • 面向 vllm
    • 已合并原始 Gemma4 支持补丁和本次保留的 vllm 侧性能优化
  • 0002-Add-Gemma4-runtime-fixes-in-vllm-ascend.patch
    • 面向 vllm-ascend
    • 已合并原始 Gemma4 Ascend 运行时补丁和本次保留的 vllm-ascend 侧兼容性/性能修复

说明:

  • 该补丁适用于 Gemma 4 家族,gemma-4-E2B-it、gemma-4-E4B-it、gemma-4-26B-A4B-it 与 gemma-4-31B-it 可复用,A2 和 A3 可复用。

2. 补丁应用

基础镜像:

  • quay.io/ascend/vllm-ascend:v0.18.0rc1

环境说明:

  • 部署 Gemma 4 前,请将 transformers 升级到 5.5.0

准备变量:

PATCH_DIR=/path/to/patches
VLLM_REPO=/vllm-workspace/vllm
VLLM_ASCEND_REPO=/vllm-workspace/vllm-ascend
MODEL_ROOT=/models

应用 vllm 补丁:

cd "$VLLM_REPO"
git apply --check "$PATCH_DIR/0001-Add-Gemma4-support-in-vllm.patch"
git apply "$PATCH_DIR/0001-Add-Gemma4-support-in-vllm.patch"

应用 vllm-ascend 补丁:

cd "$VLLM_ASCEND_REPO"
git apply --check "$PATCH_DIR/0002-Add-Gemma4-runtime-fixes-in-vllm-ascend.patch"
git apply "$PATCH_DIR/0002-Add-Gemma4-runtime-fixes-in-vllm-ascend.patch"

如需回退已应用补丁,可执行:

cd "$VLLM_ASCEND_REPO"
git apply -R --check "$PATCH_DIR/0002-Add-Gemma4-runtime-fixes-in-vllm-ascend.patch"
git apply -R "$PATCH_DIR/0002-Add-Gemma4-runtime-fixes-in-vllm-ascend.patch"

cd "$VLLM_REPO"
git apply -R --check "$PATCH_DIR/0001-Add-Gemma4-support-in-vllm.patch"
git apply -R "$PATCH_DIR/0001-Add-Gemma4-support-in-vllm.patch"

3. 服务启动

3.1 gemma-4-E2B-it 单逻辑 NPU

cd /workspace
ASCEND_RT_VISIBLE_DEVICES=0 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve "$MODEL_ROOT"/gemma-4-E2B-it \
  --served-model-name gemma-4-E2B-it \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --enable-prefix-caching \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY","cudagraph_capture_sizes":[1,2,4,8]}'

3.2 gemma-4-E4B-it 单逻辑 NPU

cd /workspace
ASCEND_RT_VISIBLE_DEVICES=0 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve "$MODEL_ROOT"/gemma-4-E4B-it \
  --served-model-name gemma-4-E4B-it \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --enable-prefix-caching \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY","cudagraph_capture_sizes":[1,2,4,8]}'

3.3 gemma-4-26B-A4B-it 两逻辑 NPU

cd /workspace
ASCEND_RT_VISIBLE_DEVICES=0,1 \
HCCL_OP_EXPANSION_MODE=AIV \
HCCL_BUFFSIZE=256 \
vllm serve "$MODEL_ROOT"/gemma-4-26B-A4B-it \
  --served-model-name gemma-4-26B-A4B-it \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --enable-prefix-caching \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY","cudagraph_capture_sizes":[1,2,4,8]}'

3.4 gemma-4-31B-it 四逻辑 NPU

cd /workspace
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve "$MODEL_ROOT"/gemma-4-31B-it \
  --served-model-name gemma-4-31B-it \
  --tensor-parallel-size 4 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --enable-prefix-caching \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY","cudagraph_capture_sizes":[1,2,4,8]}'

4. 基本验证

服务就绪检查:

curl -sS http://127.0.0.1:8000/v1/models

文本生成验证:

curl -sS http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"gemma-4-31B-it",
    "messages":[
      {"role":"user","content":"介绍下xx大模型"}
    ],
    "temperature":1.0,
    "max_tokens":512
  }'

5. 性能测试口径

本次性能数据全部通过 vllm bench serve 获取,口径如下:

  • 接口:/v1/chat/completions
  • 后端:openai-chat
  • 数据集:random
  • 并发:1
  • 请求数:1
  • 输出长度:1024
  • 输入长度:
    • 1024
    • 16384
    • 65536
  • temperature=0
  • seed=20260410

为体现 prefix cache 命中效果,本次每个工作负载都连续执行两次:

  • 第一次:仅预热
  • 第二次:验收结果

下表记录的全部是第二次运行指标。

示例命令:

vllm bench serve \
  --backend openai-chat \
  --endpoint /v1/chat/completions \
  --base-url http://127.0.0.1:8000 \
  --model /home/model_gemma4/gemma-4-E4B-it \
  --served-model-name gemma-4-E4B-it \
  --tokenizer /home/model_gemma4/gemma-4-E4B-it \
  --dataset-name random \
  --random-input-len 65536 \
  --random-output-len 1024 \
  --num-prompts 1 \
  --max-concurrency 1 \
  --seed 20260410 \
  --temperature 0 \
  --disable-tqdm \
  --save-result \
  --result-dir /workspace/results/gemma-4-E4B-it/64k_1k \
  --result-filename round2.json \
  --ignore-eos \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 0,100

6. 性能参考

6.1 gemma-4-E2B-it

A2

序列长度Benchmark 时长(s)输出吞吐(tok/s)总吞吐(tok/s)TTFT(ms)TPOT(ms)
1k/1k12.7580.31160.62121.3012.35
16k/1k13.0678.431333.28222.1912.55
64k/1k19.4152.773429.92650.7018.33

A3

序列长度Benchmark 时长(s)输出吞吐(tok/s)总吞吐(tok/s)TTFT(ms)TPOT(ms)
1k/1k7.82130.97261.9571.547.57
16k/1k11.3090.601540.22140.0610.91
64k/1k29.1735.112281.94479.0828.04

6.2 gemma-4-E4B-it

A2

序列长度Benchmark 时长(s)输出吞吐(tok/s)总吞吐(tok/s)TTFT(ms)TPOT(ms)
1k/1k14.7869.27138.54140.2514.31
16k/1k16.5461.911052.50232.7015.94
64k/1k24.1142.482761.17668.8322.91

A3

序列长度Benchmark 时长(s)输出吞吐(tok/s)总吞吐(tok/s)TTFT(ms)TPOT(ms)
1k/1k12.6081.26162.5381.8112.24
16k/1k16.0663.771084.12147.6715.55
64k/1k23.4943.602833.97453.7622.51

6.3 gemma-4-26B-A4B-it

A2

序列长度Benchmark 时长(s)输出吞吐(tok/s)总吞吐(tok/s)TTFT(ms)TPOT(ms)
1k/1k12.2883.36166.72157.1411.85
16k/1k14.7169.601183.15265.3714.12
64k/1k20.2150.673293.70694.9619.07

A3

序列长度Benchmark 时长(s)输出吞吐(tok/s)总吞吐(tok/s)TTFT(ms)TPOT(ms)
1k/1k11.6587.92175.8490.0111.30
16k/1k14.3571.361213.13163.4413.87
64k/1k19.8251.673358.36491.4518.89

6.4 gemma-4-31B-it

A2

序列长度Benchmark 时长(s)输出吞吐(tok/s)总吞吐(tok/s)TTFT(ms)TPOT(ms)
1k/1k22.5545.4190.81187.9321.86
16k/1k27.5437.18632.14298.7626.63
64k/1k38.3026.741737.82784.1436.67

A3

序列长度Benchmark 时长(s)输出吞吐(tok/s)总吞吐(tok/s)TTFT(ms)TPOT(ms)
1k/1k20.7349.3998.79101.0820.17
16k/1k25.8839.56672.59171.9825.13
64k/1k36.6127.971818.23564.2935.23

7. 注意事项

  • gemma-4-26B-A4B-it 双卡场景仍建议显式设置 HCCL_BUFFSIZE=256
  • 64k/1k 测试依赖足够大的 max_model_len,本次四个模型的默认配置均满足要求
  • 若后续需在其他 NPU 编号或其他环境中使用本 README,建议重新进行实际测试,请勿直接套用本机 A2/A3 数据