本目录提供的是本次 patch 交付物:
0001-vllm.patch0002-vllm-ascend.patch其中:
0001-vllm.patch
vllmgemma4_unified 架构支持及相关兼容修复0002-vllm-ascend.patch
vllm-ascend说明:
gemma-4-E2B-itgemma-4-E4B-itgemma-4-12B-itgemma-4-26B-A4B-itgemma-4-31B-it基础镜像:
quay.io/ascend/vllm-ascend:v0.20.2rc1环境说明:
transformers 升级到 5.10.1vllm==0.20.2+emptyvllm_ascend==0.20.2rc1transformers==5.10.1升级 transformers:
python -m pip install --upgrade "transformers==5.10.1"
python - <<'PY'
import transformers
print(transformers.__version__)
PY如需确认当前环境中的代码路径:
python -m pip show vllm vllm-ascend transformers准备变量:
PATCH_DIR=/path/to/patches
VLLM_REPO=/vllm-workspace/vllm
VLLM_ASCEND_REPO=/vllm-workspace/vllm-ascend
MODEL_ROOT=/models应用 vllm 补丁:
cd "$VLLM_REPO"
git apply --check "$PATCH_DIR/0001-vllm.patch"
git apply "$PATCH_DIR/0001-vllm.patch"应用 vllm-ascend 补丁:
cd "$VLLM_ASCEND_REPO"
git apply --check "$PATCH_DIR/0002-vllm-ascend.patch"
git apply "$PATCH_DIR/0002-vllm-ascend.patch"如需回退已应用补丁,可执行:
cd "$VLLM_ASCEND_REPO"
git apply -R --check "$PATCH_DIR/0002-vllm-ascend.patch"
git apply -R "$PATCH_DIR/0002-vllm-ascend.patch"
cd "$VLLM_REPO"
git apply -R --check "$PATCH_DIR/0001-vllm.patch"
git apply -R "$PATCH_DIR/0001-vllm.patch"以下命令均开启 FULL_DECODE_ONLY,且不传 cudagraph_capture_sizes 参数。
gemma-4-E2B-it 单逻辑 NPUcd /workspace
ASCEND_RT_VISIBLE_DEVICES=0 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve "$MODEL_ROOT"/gemma-4-E2B-it \
--served-model-name gemma-4-E2B-it \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--enable-prefix-caching \
--limit-mm-per-prompt '{"image":2,"audio":1,"video":1}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'gemma-4-E4B-it 单逻辑 NPUcd /workspace
ASCEND_RT_VISIBLE_DEVICES=0 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve "$MODEL_ROOT"/gemma-4-E4B-it \
--served-model-name gemma-4-E4B-it \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--enable-prefix-caching \
--limit-mm-per-prompt '{"image":2,"audio":1,"video":1}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'gemma-4-12B-it 两逻辑 NPUcd /workspace
ASCEND_RT_VISIBLE_DEVICES=0,1 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve "$MODEL_ROOT"/gemma-4-12B-it \
--served-model-name gemma-4-12B-it \
--tensor-parallel-size 2 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--enable-prefix-caching \
--limit-mm-per-prompt '{"image":2,"audio":1,"video":1}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'gemma-4-26B-A4B-it 两逻辑 NPUcd /workspace
ASCEND_RT_VISIBLE_DEVICES=0,1 \
HCCL_OP_EXPANSION_MODE=AIV \
HCCL_BUFFSIZE=256 \
vllm serve "$MODEL_ROOT"/gemma-4-26B-A4B-it \
--served-model-name gemma-4-26B-A4B-it \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--enable-prefix-caching \
--limit-mm-per-prompt '{"image":2,"audio":1,"video":1}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'gemma-4-31B-it 四逻辑 NPUcd /workspace
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve "$MODEL_ROOT"/gemma-4-31B-it \
--served-model-name gemma-4-31B-it \
--tensor-parallel-size 4 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--enable-prefix-caching \
--limit-mm-per-prompt '{"image":2,"audio":1,"video":1}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'服务就绪检查:
curl -sS http://127.0.0.1:8000/v1/models文本生成验证:
curl -sS http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"gemma-4-12B-it",
"messages":[
{"role":"user","content":"介绍下 Gemma 4 模型"}
],
"temperature":1.0,
"max_tokens":512
}'thinking 验证:
curl -sS http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"gemma-4-12B-it",
"messages":[
{"role":"user","content":"计算 1234 + 5678,并简要说明过程"}
],
"temperature":1.0,
"max_tokens":1024,
"chat_template_kwargs":{"enable_thinking":true}
}'FULL_DECODE_ONLY 日志检查:
grep -E "Capturing CUDA graphs \$decode, FULL\$|Graph capturing finished|Application startup complete" server.loggemma-4-12B-it 使用 Gemma4UnifiedForConditionalGeneration,依赖 transformers==5.10.1。gemma-4-26B-A4B-it 两卡场景建议显式设置 HCCL_BUFFSIZE=256,并开启 --enable-expert-parallel。