LLaDA2.0-Uni：基于扩散大型语言模型的多模态理解与生成统一框架

[📑 技术报告 ] [🌐 代码仓库 ] [🤗 FP8 版本 ]

AGI 研究中心，Inclusion AI

模型能力

LLaDA2.0-Uni 是一款基于混合专家（Mixture-of-Experts, MoE）架构的统一扩散大型语言模型（dLLM），它在单一模型内无缝整合了多模态理解与生成能力。其支持功能包括：

🖼️ 文本到图像生成 — 高保真图像合成，支持可选的思考/推理过程。
🔍 图像理解 — 视觉问答、图像描述、文档理解等。
✏️ 图像编辑 — 基于指令的编辑，支持单参考或多参考图像。
🎨 交错生成与推理 — 初步支持交错生成，并解锁高级交错推理能力。
⚡ 快速加速 — 通过 KV 缓存复用和自适应去掩码实现更快推理。

模型架构

统一 dLLM-MoE 骨干网络：将多模态理解与生成统一到简洁的掩码令牌预测范式中。
离散语义令牌器：采用 SigLIP-VQ 将视觉输入转换为离散语义令牌，显著提升多模态理解能力。
高效扩散解码器：将离散令牌与专用扩散解码器配对，实现高保真生成，并通过蒸馏支持快速 8 步推理。

评估结果

快速开始

注意：完整的安装说明和命令行脚本可在 GitHub 仓库中获取。

⚙️ 安装

1. 创建 conda 环境

git clone https://github.com/inclusionAI/LLaDA2.0-Uni && cd LLaDA2.0-Uni
conda create -n llada2_uni python=3.10 -y
conda activate llada2_uni

2. 安装 PyTorch（CUDA 12.4）

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

3. 安装 Flash Attention 2（高效推理所需）

pip install flash-attn --no-build-isolation

4. 安装剩余依赖项

pip install -r requirements.txt

🌟 文本到图像生成

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens

model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Generate image tokens
result = model.generate_image(
    "A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f/2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.",
    image_h=1024, image_w=1024,
    steps=8, cfg_scale=2.0,
)

# Decode to PIL image (default: 50-step ODE)
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda")
image.save("output.png")

[!Note] 💡 更快的解码速度 — 使用 decoder-turbo（蒸馏解码器）可实现约10倍更快的图像解码（仅需8步，而非原来的50步），且质量损失极小：
image = decode_vq_tokens(
    result["token_ids"], result["h"], result["w"], model_path, "cuda",
    num_steps=8, decode_mode="decoder-turbo",
)

🌟 带思维链的文本到图像生成

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens

model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Generate image tokens with thinking process
result = model.generate_image(
    "A fox with thick, dense, fluffy fur in a winter setting, possibly surrounded by snow.",
    image_h=1024, image_w=1024,
    mode="thinking",
    steps=8, cfg_scale=2.0,
    thinking_steps=32, thinking_gen_length=4096,
)

# Print thinking trace
print("Thinking:", result["thinking"])

# Decode to PIL image
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda", num_steps=8, decode_mode="decoder-turbo",)
image.save("output_thinking.png")

🌟 图像理解

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from encoder.image_tokenizer import ImageTokenizer
from decoder.smart_img_process import smart_resize_images

model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Encode image to discrete tokens
image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda")
pil_image = smart_resize_images(["./assets/understanding_example.png"])[0]
info = image_tokenizer.encode_with_info(pil_image)
image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]]
_, h, w = info["grid_thw"]

# Understand the image
response = model.understand_image(
    image_tokens, h, w,
    question="Describe this image in detail.",
    steps=32, gen_length=2048,
)
print(response)

🌟 图像编辑

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from encoder.image_tokenizer import ImageTokenizer
from decoder.utils import generate_crop_size_list, var_center_crop
from decoder import decode_vq_tokens
from PIL import Image

model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Encode source image
image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda")
crop_size_list = generate_crop_size_list((512 // 32) ** 2, 32)
pil_image = var_center_crop(Image.open("./assets/edit_example.png").convert("RGB"), crop_size_list=crop_size_list)
info = image_tokenizer.encode_with_info(pil_image)
image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]]
_, h, w = info["grid_thw"]

# Edit the image
result = model.edit_image(
    image_tokens, h, w,
    instruction="Change the background to a beach.",
    steps=8, cfg_text_scale=4.0,
)

# Decode to PIL image
edited_image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda", num_steps=8, decode_mode="decoder-turbo",)
edited_image.save("edited.png")

🌟 SPRINT 加速技术

SPRINT 通过融合KV缓存复用、自适应掩码解除和基于阈值的批量接受三大技术，实现推理加速：

KV缓存复用与剪枝：在预热阶段一次性计算前缀KV缓存，随后可根据重要性分数（融合KV注意力重要性与token置信度）对其进行剪枝。后续去噪步骤直接复用缓存的前缀，大幅减少计算量。通过设置模态级保留比例（image_keep_ratio、text_keep_ratio），可实现精细化控制——例如，为保证质量保留所有图像/文本token，同时仍能享受缓存复用带来的收益。
自适应掩码解除：不同于每步固定解除掩码token数量的传统方式，Sprint根据模型置信度动态决定每步揭示的token数量。在每个步骤中，系统通过low_confidence、top_k_margin或neg_entropy等策略计算置信度分数，并传递top-k个置信度最高的token，其中k值自适应设为ceil(remaining_masked / steps_left)。这使得简单位置的token能快速被解析，同时将计算资源集中在更难处理的token上。
批量接受：在自适应调度基础上，所有概率超过threshold的token将被批量接受，进一步减少所需的去噪迭代次数。

SPRINT的图像理解能力：

response = model.understand_image(
    image_tokens, h, w,
    question="Describe this image in detail.",
    steps=32, gen_length=4096,
    use_sprint=True,
    threshold=0.93,
    keep_ratio=0.5,
    cache_warmup_steps=1,
    image_keep_ratio=1.0,
    text_keep_ratio=1.0,
)

文生图（Text-to-Image）搭配 Sprint：

result = model.generate_image(
    "A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f/2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.",
    image_h=1024, image_w=1024,
    cfg_scale=2.0,
    use_sprint=True,
    block_length=32,
    steps=8,
    keep_ratio=0.5,
    cache_warmup_steps=1,
)

[!Note] Sprint 支持 Simple CFG 和 no-CFG 模式。使用 Editing CFG（通过 cfg_text_scale / cfg_image_scale 进行三向引导）时，Sprint 会自动回退到基准模式。

仓库结构

LLaDA2-Uni/
├── config.json                          # Model configuration
├── modeling_llada2uni_moe.py            # Model implementation (trust_remote_code)
├── configuration_llada2uni_moe.py       # Config class
├── tokenizer.json                       # Tokenizer
├── model-00001-of-00013.safetensors     # MoE backbone weights (sharded, bf16)
├── ...
├── model-00013-of-00013.safetensors
├── model.safetensors.index.json
├── image_tokenizer/
│   ├── config.json
│   ├── image_tokenizer.safetensors      # SigLIP-VQ encoder
│   ├── sigvq_embedding.pt               # SigVQ embedding + projector
│   └── preprocessor_config.json
├── decoder/
│   ├── config.json
│   └── decoder_model.safetensors        # Diffusion decoder (bf16, 12GB)
├── decoder-turbo/
│   ├── config.json
│   └── decoder_model.safetensors        # Distilled few-step decoder (bf16, 12GB)
└── vae/
    ├── config.json
    └── diffusion_pytorch_model.safetensors

硬件要求

组件	GPU 显存
MoE 主干网络（bf16，总计 16B）	~32 GB
扩散解码器（bf16，6.2B）	~12 GB
VAE + SigVQ + 分词器	~3 GB
总计（生成/编辑）	~47 GB
总计（仅理解）	~35 GB

💡 尽管推理过程中每个 token 仅激活约 1B 参数，但所有 16B MoE 参数都必须加载到内存中。扩散解码器仅在图像生成/编辑时需要，之后会被释放。

🚀 SGLang 支持（即将推出）

我们正在致力于集成 SGLang，以实现高吞吐量服务和优化推理。敬请期待！

⚠️ 许可证

本项目基于 Apache License 2.0 许可证授权。

📖 BibTeX

@article{LLaDA2Uni,
title = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},
author = {Tiwei Bie and Haoxing Chen and Tieyuan Chen and Zhenglin Cheng and Long Cui and Kai Gan and Zhicheng Huang and Zhenzhong Lan and Haoquan Li and Jianguo Li and Tao Lin and Qi Qin and Hongjun Wang and Xiaomei Wang and Haoyuan Wu and Yi Xin and Junbo Zhao},
journal = {arXiv preprint arXiv:2604.20796},
year = {2026}
}