gemma-4-31B-it-DFlash

DFlash 是一种推测解码方法，它使用轻量级块扩散模型并行生成多个令牌。这是一个草稿模型，必须与 google/gemma-4-31B-it 配合使用。

快速开始

安装

vLLM：在 Gemma4 DFlash 支持合并之前，请从 PR #41703 安装 vLLM：

uv pip install -U --torch-backend=auto \
  "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/41703/head"

SGLang：

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/23000/head#subdirectory=python"

启动服务器

vLLM：

vllm serve google/gemma-4-31B-it \
  --speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-31B-it-DFlash", "num_speculative_tokens": 15, "attention_backend": "flash_attn"}' \
  --attention-backend triton_attn \
  --max-num-batched-tokens 32768 \
  --trust-remote-code

SGLang：

# Optional: enable schedule overlapping (experimental, may not be stable)
# export SGLANG_ENABLE_SPEC_V2=1
# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

python -m sglang.launch_server \
  --model-path google/gemma-4-31B-it \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/gemma-4-31B-it-DFlash \
  --speculative-num-draft-tokens 16 \
  --tp-size 1 \
  --attention-backend triton \
  --speculative-draft-attention-backend fa4 \
  --trust-remote-code

使用方法

对于 vLLM，请使用端口 8000。对于 SGLang，请使用端口 30000。

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=4096,
    temperature=0.0,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.content)

基准测试结果

测试环境： 每台服务器/每次运行使用单张 NVIDIA B300 GPU，vLLM，启用思考模式，最大输出长度 4096，贪婪解码。

吞吐量与加速比

在并发度为 1 时，DFlash 实现了高达 5.8 倍 的加速比。

生成 tokens/秒（相对于自回归基线的加速比）

块大小 = 16

任务	并发度	自回归（AR）	DFlash
Math500	1	77	447 (5.8 倍)
	8	511	2650 (5.2 倍)
	32	1308	4962 (3.8 倍)
GSM8K	1	78	408 (5.3 倍)
	8	520	2321 (4.5 倍)
	32	1382	4447 (3.2 倍)
HumanEval	1	76	420 (5.6 倍)
	8	494	2389 (4.8 倍)
	32	1145	4139 (3.6 倍)
MBPP	1	79	343 (4.4 倍)
	8	535	2036 (3.8 倍)
	32	1389	3636 (2.6 倍)
MT-Bench	1	79	236 (3.0 倍)
	8	503	1334 (2.7 倍)
	32	1177	2257 (1.9 倍)

接受长度

任务	c1	c8	c32
Math500	8.59	8.59	8.62
GSM8K	7.53	7.50	7.52
HumanEval	8.00	7.89	7.96
MBPP	6.13	6.13	6.14
MT-Bench	4.23	4.19	4.19

致谢

特别感谢 David Wang 为该项目提供的卓越工程支持。我们也感谢 Modal、InnoMatrix 和 Yotta Labs 提供用于训练此 draft 模型的计算资源。

引用

如果您发现 DFlash 有用，请引用我们的工作。如需分享关于 DFlash 的反馈或请求新的模型支持，请填写此表单：DFlash Feedback。

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}

gemma-4-31B-it-DFlash

论文 | GitHub | 博客

DFlash 是一种推测解码方法，它使用轻量级块扩散模型并行生成多个令牌。这是一个草稿模型，必须与 google/gemma-4-31B-it 配合使用。

快速开始

安装

vLLM：在 Gemma4 DFlash 支持合并之前，请从 PR #41703 安装 vLLM：

uv pip install -U --torch-backend=auto \
  "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/41703/head"

SGLang：

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/23000/head#subdirectory=python"

启动服务器

vLLM：

vllm serve google/gemma-4-31B-it \
  --speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-31B-it-DFlash", "num_speculative_tokens": 15, "attention_backend": "flash_attn"}' \
  --attention-backend triton_attn \
  --max-num-batched-tokens 32768 \
  --trust-remote-code

SGLang：

# Optional: enable schedule overlapping (experimental, may not be stable)
# export SGLANG_ENABLE_SPEC_V2=1
# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

python -m sglang.launch_server \
  --model-path google/gemma-4-31B-it \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/gemma-4-31B-it-DFlash \
  --speculative-num-draft-tokens 16 \
  --tp-size 1 \
  --attention-backend triton \
  --speculative-draft-attention-backend fa4 \
  --trust-remote-code

使用方法

对于 vLLM，请使用端口 8000。对于 SGLang，请使用端口 30000。

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=4096,
    temperature=0.0,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.content)

基准测试结果

测试环境： 每台服务器/每次运行使用单张 NVIDIA B300 GPU，vLLM，启用思考模式，最大输出长度 4096，贪婪解码。

吞吐量与加速比

在并发度为 1 时，DFlash 实现了高达 5.8 倍 的加速比。

生成 tokens/秒（相对于自回归基线的加速比）

块大小 = 16

任务	并发度	自回归（AR）	DFlash
Math500	1	77	447 (5.8 倍)
	8	511	2650 (5.2 倍)
	32	1308	4962 (3.8 倍)
GSM8K	1	78	408 (5.3 倍)
	8	520	2321 (4.5 倍)
	32	1382	4447 (3.2 倍)
HumanEval	1	76	420 (5.6 倍)
	8	494	2389 (4.8 倍)
	32	1145	4139 (3.6 倍)
MBPP	1	79	343 (4.4 倍)
	8	535	2036 (3.8 倍)
	32	1389	3636 (2.6 倍)
MT-Bench	1	79	236 (3.0 倍)
	8	503	1334 (2.7 倍)
	32	1177	2257 (1.9 倍)

接受长度

任务	c1	c8	c32
Math500	8.59	8.59	8.62
GSM8K	7.53	7.50	7.52
HumanEval	8.00	7.89	7.96
MBPP	6.13	6.13	6.14
MT-Bench	4.23	4.19	4.19

致谢

特别感谢 David Wang 为该项目提供的卓越工程支持。我们也感谢 Modal、InnoMatrix 和 Yotta Labs 提供用于训练此 draft 模型的计算资源。

引用

如果您发现 DFlash 有用，请引用我们的工作。如需分享关于 DFlash 的反馈或请求新的模型支持，请填写此表单：DFlash Feedback。

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}