LLaDA2.0-Uni 是一款基于混合专家(Mixture-of-Experts, MoE)架构的统一扩散大型语言模型(dLLM),它在单一模型内无缝整合了多模态理解与生成能力。其支持功能包括:
注意:完整的安装说明和命令行脚本可在 GitHub 仓库 中获取。
git clone https://github.com/inclusionAI/LLaDA2.0-Uni && cd LLaDA2.0-Uni
conda create -n llada2_uni python=3.10 -y
conda activate llada2_unipip install torch torchvision --index-url https://download.pytorch.org/whl/cu124pip install flash-attn --no-build-isolationpip install -r requirements.txtimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens
model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Generate image tokens
result = model.generate_image(
"A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f/2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.",
image_h=1024, image_w=1024,
steps=8, cfg_scale=2.0,
)
# Decode to PIL image (default: 50-step ODE)
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda")
image.save("output.png")[!Note] 💡 更快的解码速度 — 使用 decoder-turbo(蒸馏解码器)可实现约10倍更快的图像解码(仅需8步,而非原来的50步),且质量损失极小:
image = decode_vq_tokens( result["token_ids"], result["h"], result["w"], model_path, "cuda", num_steps=8, decode_mode="decoder-turbo", )
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens
model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Generate image tokens with thinking process
result = model.generate_image(
"A fox with thick, dense, fluffy fur in a winter setting, possibly surrounded by snow.",
image_h=1024, image_w=1024,
mode="thinking",
steps=8, cfg_scale=2.0,
thinking_steps=32, thinking_gen_length=4096,
)
# Print thinking trace
print("Thinking:", result["thinking"])
# Decode to PIL image
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda", num_steps=8, decode_mode="decoder-turbo",)
image.save("output_thinking.png")import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from encoder.image_tokenizer import ImageTokenizer
from decoder.smart_img_process import smart_resize_images
model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Encode image to discrete tokens
image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda")
pil_image = smart_resize_images(["./assets/understanding_example.png"])[0]
info = image_tokenizer.encode_with_info(pil_image)
image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]]
_, h, w = info["grid_thw"]
# Understand the image
response = model.understand_image(
image_tokens, h, w,
question="Describe this image in detail.",
steps=32, gen_length=2048,
)
print(response)import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from encoder.image_tokenizer import ImageTokenizer
from decoder.utils import generate_crop_size_list, var_center_crop
from decoder import decode_vq_tokens
from PIL import Image
model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Encode source image
image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda")
crop_size_list = generate_crop_size_list((512 // 32) ** 2, 32)
pil_image = var_center_crop(Image.open("./assets/edit_example.png").convert("RGB"), crop_size_list=crop_size_list)
info = image_tokenizer.encode_with_info(pil_image)
image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]]
_, h, w = info["grid_thw"]
# Edit the image
result = model.edit_image(
image_tokens, h, w,
instruction="Change the background to a beach.",
steps=8, cfg_text_scale=4.0,
)
# Decode to PIL image
edited_image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda", num_steps=8, decode_mode="decoder-turbo",)
edited_image.save("edited.png")SPRINT 通过融合KV缓存复用、自适应掩码解除和基于阈值的批量接受三大技术,实现推理加速:
image_keep_ratio、text_keep_ratio),可实现精细化控制——例如,为保证质量保留所有图像/文本token,同时仍能享受缓存复用带来的收益。low_confidence、top_k_margin或neg_entropy等策略计算置信度分数,并传递top-k个置信度最高的token,其中k值自适应设为ceil(remaining_masked / steps_left)。这使得简单位置的token能快速被解析,同时将计算资源集中在更难处理的token上。threshold的token将被批量接受,进一步减少所需的去噪迭代次数。SPRINT的图像理解能力:
response = model.understand_image(
image_tokens, h, w,
question="Describe this image in detail.",
steps=32, gen_length=4096,
use_sprint=True,
threshold=0.93,
keep_ratio=0.5,
cache_warmup_steps=1,
image_keep_ratio=1.0,
text_keep_ratio=1.0,
)文生图(Text-to-Image)搭配 Sprint:
result = model.generate_image(
"A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f/2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.",
image_h=1024, image_w=1024,
cfg_scale=2.0,
use_sprint=True,
block_length=32,
steps=8,
keep_ratio=0.5,
cache_warmup_steps=1,
)[!Note] Sprint 支持 Simple CFG 和 no-CFG 模式。使用 Editing CFG(通过
cfg_text_scale/cfg_image_scale进行三向引导)时,Sprint 会自动回退到基准模式。
LLaDA2-Uni/
├── config.json # Model configuration
├── modeling_llada2uni_moe.py # Model implementation (trust_remote_code)
├── configuration_llada2uni_moe.py # Config class
├── tokenizer.json # Tokenizer
├── model-00001-of-00013.safetensors # MoE backbone weights (sharded, bf16)
├── ...
├── model-00013-of-00013.safetensors
├── model.safetensors.index.json
├── image_tokenizer/
│ ├── config.json
│ ├── image_tokenizer.safetensors # SigLIP-VQ encoder
│ ├── sigvq_embedding.pt # SigVQ embedding + projector
│ └── preprocessor_config.json
├── decoder/
│ ├── config.json
│ └── decoder_model.safetensors # Diffusion decoder (bf16, 12GB)
├── decoder-turbo/
│ ├── config.json
│ └── decoder_model.safetensors # Distilled few-step decoder (bf16, 12GB)
└── vae/
├── config.json
└── diffusion_pytorch_model.safetensors| 组件 | GPU 显存 |
|---|---|
| MoE 主干网络(bf16,总计 16B) | ~32 GB |
| 扩散解码器(bf16,6.2B) | ~12 GB |
| VAE + SigVQ + 分词器 | ~3 GB |
| 总计(生成/编辑) | ~47 GB |
| 总计(仅理解) | ~35 GB |
💡 尽管推理过程中每个 token 仅激活约 1B 参数,但所有 16B MoE 参数都必须加载到内存中。扩散解码器仅在图像生成/编辑时需要,之后会被释放。
我们正在致力于集成 SGLang,以实现高吞吐量服务和优化推理。敬请期待!
本项目基于 Apache License 2.0 许可证授权。
@article{LLaDA2Uni,
title = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},
author = {Tiwei Bie and Haoxing Chen and Tieyuan Chen and Zhenglin Cheng and Long Cui and Kai Gan and Zhicheng Huang and Zhenzhong Lan and Haoquan Li and Jianguo Li and Tao Lin and Qi Qin and Hongjun Wang and Xiaomei Wang and Haoyuan Wu and Yi Xin and Junbo Zhao},
journal = {arXiv preprint arXiv:2604.20796},
year = {2026}
}