Ascend-SACT/Gemma-4
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

Gemma 4 补丁应用与运行手册

1. 简介

本目录提供的是本次 patch 交付物:

  • 0001-vllm.patch
  • 0002-vllm-ascend.patch

其中:

  • 0001-vllm.patch
    • 面向 vllm
    • 包含 Gemma 4 系列模型接入、gemma4_unified 架构支持及相关兼容修复
  • 0002-vllm-ascend.patch
    • 面向 vllm-ascend
    • 包含 Gemma 4 在 Ascend 上的运行时、图模式、MoE/EP、attention 等兼容性与性能修复

说明:

  • 该补丁适用于 Gemma 4 家族:
    • gemma-4-E2B-it
    • gemma-4-E4B-it
    • gemma-4-12B-it
    • gemma-4-26B-A4B-it
    • gemma-4-31B-it
  • 当前文档基于 A2 环境重新实测性能,仅包含 A2 数据。

2. 补丁应用

基础镜像:

  • quay.io/ascend/vllm-ascend:v0.20.2rc1

环境说明:

  • 部署 Gemma 4 前,请将 transformers 升级到 5.10.1
  • 当前验证环境版本:
    • vllm==0.20.2+empty
    • vllm_ascend==0.20.2rc1
    • transformers==5.10.1

升级 transformers:

python -m pip install --upgrade "transformers==5.10.1"
python - <<'PY'
import transformers
print(transformers.__version__)
PY

如需确认当前环境中的代码路径:

python -m pip show vllm vllm-ascend transformers

准备变量:

PATCH_DIR=/path/to/patches
VLLM_REPO=/vllm-workspace/vllm
VLLM_ASCEND_REPO=/vllm-workspace/vllm-ascend
MODEL_ROOT=/models

应用 vllm 补丁:

cd "$VLLM_REPO"
git apply --check "$PATCH_DIR/0001-vllm.patch"
git apply "$PATCH_DIR/0001-vllm.patch"

应用 vllm-ascend 补丁:

cd "$VLLM_ASCEND_REPO"
git apply --check "$PATCH_DIR/0002-vllm-ascend.patch"
git apply "$PATCH_DIR/0002-vllm-ascend.patch"

如需回退已应用补丁,可执行:

cd "$VLLM_ASCEND_REPO"
git apply -R --check "$PATCH_DIR/0002-vllm-ascend.patch"
git apply -R "$PATCH_DIR/0002-vllm-ascend.patch"

cd "$VLLM_REPO"
git apply -R --check "$PATCH_DIR/0001-vllm.patch"
git apply -R "$PATCH_DIR/0001-vllm.patch"

3. 服务启动

以下命令均开启 FULL_DECODE_ONLY,且不传 cudagraph_capture_sizes 参数。

3.1 gemma-4-E2B-it 单逻辑 NPU

cd /workspace
ASCEND_RT_VISIBLE_DEVICES=0 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve "$MODEL_ROOT"/gemma-4-E2B-it \
  --served-model-name gemma-4-E2B-it \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --enable-prefix-caching \
  --limit-mm-per-prompt '{"image":2,"audio":1,"video":1}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

3.2 gemma-4-E4B-it 单逻辑 NPU

cd /workspace
ASCEND_RT_VISIBLE_DEVICES=0 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve "$MODEL_ROOT"/gemma-4-E4B-it \
  --served-model-name gemma-4-E4B-it \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --enable-prefix-caching \
  --limit-mm-per-prompt '{"image":2,"audio":1,"video":1}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

3.3 gemma-4-12B-it 两逻辑 NPU

cd /workspace
ASCEND_RT_VISIBLE_DEVICES=0,1 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve "$MODEL_ROOT"/gemma-4-12B-it \
  --served-model-name gemma-4-12B-it \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --enable-prefix-caching \
  --limit-mm-per-prompt '{"image":2,"audio":1,"video":1}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

3.4 gemma-4-26B-A4B-it 两逻辑 NPU

cd /workspace
ASCEND_RT_VISIBLE_DEVICES=0,1 \
HCCL_OP_EXPANSION_MODE=AIV \
HCCL_BUFFSIZE=256 \
vllm serve "$MODEL_ROOT"/gemma-4-26B-A4B-it \
  --served-model-name gemma-4-26B-A4B-it \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --enable-prefix-caching \
  --limit-mm-per-prompt '{"image":2,"audio":1,"video":1}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

3.5 gemma-4-31B-it 四逻辑 NPU

cd /workspace
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve "$MODEL_ROOT"/gemma-4-31B-it \
  --served-model-name gemma-4-31B-it \
  --tensor-parallel-size 4 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --enable-prefix-caching \
  --limit-mm-per-prompt '{"image":2,"audio":1,"video":1}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

4. 基本验证

服务就绪检查:

curl -sS http://127.0.0.1:8000/v1/models

文本生成验证:

curl -sS http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"gemma-4-12B-it",
    "messages":[
      {"role":"user","content":"介绍下 Gemma 4 模型"}
    ],
    "temperature":1.0,
    "max_tokens":512
  }'

thinking 验证:

curl -sS http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"gemma-4-12B-it",
    "messages":[
      {"role":"user","content":"计算 1234 + 5678,并简要说明过程"}
    ],
    "temperature":1.0,
    "max_tokens":1024,
    "chat_template_kwargs":{"enable_thinking":true}
  }'

FULL_DECODE_ONLY 日志检查:

grep -E "Capturing CUDA graphs \$decode, FULL\$|Graph capturing finished|Application startup complete" server.log

5. 注意事项

  • gemma-4-12B-it 使用 Gemma4UnifiedForConditionalGeneration,依赖 transformers==5.10.1。
  • gemma-4-26B-A4B-it 两卡场景建议显式设置 HCCL_BUFFSIZE=256,并开启 --enable-expert-parallel。