DFlash 是一种推测式解码方法,它采用轻量级 块扩散 模型并行生成多个 tokens。这是一个草稿模型,必须与 Qwen/Qwen3.6-35B-A3B 配合使用。
vLLM(我们通过此 PR 对安装方式进行临时修改,以支持交错式 SWA 并确保正确处理目标隐藏状态,从而实现最佳性能):
uv pip install vllm
uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"SGLang:
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"vLLM:
vllm serve Qwen/Qwen3.6-35B-A3B \
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}' \
--attention-backend flash_attn \
--max-num-batched-tokens 32768SGLang:
# Optional: enable schedule overlapping (experimental, may not be stable)
# export SGLANG_ENABLE_SPEC_V2=1
# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-35B-A3B \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3.6-35B-A3B-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 1 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--mamba-scheduler-strategy extra_buffer \
--trust-remote-code提示: 对于长上下文或智能体工作负载,添加
--speculative-dflash-draft-window-size WINDOW_SIZE以启用草稿模型的滑动窗口注意力机制。
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[{"role": "user", "content": "Write a quicksort in Python."}],
max_tokens=4096,
temperature=0.0
)
print(response.choices[0].message.content)测试环境: 单 NVIDIA B200,SGLang,启用思考模式,最大输出长度 4096。我们报告的是端到端吞吐量,包含预填充时间。可参考我们的 GitHub 仓库 获取复现脚本。
DFlash 在并发数为 1 时实现了高达 2.9 倍的加速。
每秒令牌数(相对于自回归基线的加速比)
块大小 = 16
| 任务 | 并发数 | AR | DFlash |
|---|---|---|---|
| Math500 | 1 | 234 | 682 (2.9x) |
| 8 | 1266 | 3138 (2.5x) | |
| 16 | 1954 | 4813 (2.5x) | |
| 32 | 2755 | 6520 (2.4x) | |
| GSM8K | 1 | 235 | 556 (2.4x) |
| 8 | 1236 | 2564 (2.1x) | |
| 16 | 1886 | 3821 (2.0x) | |
| 32 | 2699 | 5239 (1.9x) | |
| HumanEval | 1 | 238 | 603 (2.5x) |
| 8 | 1255 | 2800 (2.2x) | |
| 16 | 1944 | 4208 (2.2x) | |
| 32 | 2767 | 5782 (2.1x) | |
| MBPP | 1 | 235 | 559 (2.4x) |
| 8 | 1224 | 2538 (2.1x) | |
| 16 | 1948 | 3816 (2.0x) | |
| 32 | 2780 | 5378 (1.9x) | |
| MT-Bench | 1 | 233 | 442 (1.9x) |
| 8 | 1238 | 2028 (1.6x) | |
| 16 | 1885 | 2997 (1.6x) | |
| 32 | 2633 | 4034 (1.5x) | |
| Alpaca | 1 | 235 | 393 (1.7x) |
| 8 | 1221 | 1782 (1.5x) | |
| 16 | 1844 | 2567 (1.4x) | |
| 32 | 2579 | 3689 (1.4x) |
块大小 = 8
| 任务 | 并发数 | AR | DFlash |
|---|---|---|---|
| Math500 | 1 | 234 | 617 (2.6x) |
| 8 | 1266 | 2839 (2.2x) | |
| 16 | 1954 | 4465 (2.3x) | |
| 32 | 2755 | 6614 (2.4x) | |
| GSM8K | 1 | 235 | 540 (2.3x) |
| 8 | 1236 | 2466 (2.0x) | |
| 16 | 1886 | 3899 (2.1x) | |
| 32 | 2699 | 5713 (2.1x) | |
| HumanEval | 1 | 238 | 561 (2.4x) |
| 8 | 1255 | 2655 (2.1x) | |
| 16 | 1944 | 4135 (2.1x) | |
| 32 | 2767 | 6059 (2.2x) | |
| MBPP | 1 | 235 | 497 (2.1x) |
| 8 | 1224 | 2324 (1.9x) | |
| 16 | 1948 | 3636 (1.9x) | |
| 32 | 2780 | 4884 (1.8x) | |
| MT-Bench | 1 | 233 | 438 (1.9x) |
| 8 | 1238 | 2060 (1.7x) | |
| 16 | 1885 | 3182 (1.7x) | |
| 32 | 2633 | 4720 (1.8x) | |
| Alpaca | 1 | 235 | 407 (1.7x) |
| 8 | 1221 | 1880 (1.5x) | |
| 16 | 1844 | 2903 (1.6x) | |
| 32 | 2579 | 4115 (1.6x) |
| 任务 | B8 | B16 |
|---|---|---|
| Math500 | 5.56 | 7.35 |
| GSM8K | 5.21 | 6.73 |
| HumanEval | 5.09 | 6.44 |
| MBPP | 4.78 | 5.83 |
| MT-Bench | 4.20 | 5.14 |
| Alpaca | 3.94 | 4.62 |
特别感谢David Wang为本项目提供的出色工程支持。同时,我们也感谢Modal、InnoMatrix和Yotta Labs提供了用于训练此 draft 模型的计算资源。
如果您觉得 DFlash 有用,请引用我们的工作。如需分享关于 DFlash 的反馈或请求新的模型支持,请填写此表单:DFlash Feedback。
@article{chen2026dflash,
title = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
journal = {arXiv preprint arXiv:2602.06036},
year = {2026}
}