LLaDA2.0-Uni：基于扩散大语言模型的多模态理解与生成统一框架

LLaDA2.0-Uni 的 FP8 量化版本

AGI 研究中心，Inclusion AI

概述

本项目是 LLaDA2.0-Uni 的FP8 量化版本，对 MoE 专家权重进行了分块 FP8 量化。这使得模型加载时的 GPU 内存占用减少约 48%，同时保持输出质量。

量化细节

方法：带每块缩放因子的分块 FP8（float8_e4m3fn）
块大小：128×128
量化层：MoE 路由专家权重（gate_proj、up_proj、down_proj）
保留 BF16：嵌入层、lm_head、注意力投影、共享专家、层归一化、路由门控

内存对比

版本	模型加载	文本转图像峰值	理解任务峰值	编辑任务峰值
BF16	62.9 GB	35.3 GB	33.2 GB	41.7 GB
FP8	32.5 GB	35.3 GB	33.3 GB	41.8 GB

注：FP8 将静态模型权重内存减半（加载时节省约 30 GB）。推理峰值内存相近，因为生成过程中激活值占主导。

快速开始

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "inclusionAI/LLaDA2.0-Uni-FP8"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="cuda", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Text-to-Image Generation
result = model.generate_image(
    "A cat sitting on a windowsill at sunset",
    image_h=1024, image_w=1024,
    steps=16, cfg_scale=4.0,
)

# Decode VQ tokens to image
from decoder import decode_vq_tokens
image = decode_vq_tokens(
    result["token_ids"], result["h"], result["w"],
    model_path, "cuda",
    num_steps=8, decode_mode="decoder-turbo",
)
image.save("output.png")

模型能力

与基础模型 LLaDA2.0-Uni 相同：

🖼️ 文本到图像生成
🔍 图像理解
✏️ 图像编辑
⚡ 快速加速

⚠️ 许可证

本项目基于 Apache License 2.0 许可证授权。

📖 BibTeX

@article{LLaDA2Uni,
title = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},
author = {Tiwei Bie and Haoxing Chen and Tieyuan Chen and Zhenglin Cheng and Long Cui and Kai Gan and Zhicheng Huang and Zhenzhong Lan and Haoquan Li and Jianguo Li and Tao Lin and Qi Qin and Hongjun Wang and Xiaomei Wang and Haoyuan Wu and Yi Xin and Junbo Zhao},
journal = {arXiv preprint arXiv:2604.20796},
year = {2026}
}

概述

本项目是 LLaDA2.0-Uni 的FP8 量化版本，对 MoE 专家权重进行了分块 FP8 量化。这使得模型加载时的 GPU 内存占用减少约 48%，同时保持输出质量。

量化细节

方法：带每块缩放因子的分块 FP8（float8_e4m3fn）

块大小：128×128

量化层：MoE 路由专家权重（gate_proj、up_proj、down_proj）

保留 BF16：嵌入层、lm_head、注意力投影、共享专家、层归一化、路由门控

内存对比

版本	模型加载	文本转图像峰值	理解任务峰值	编辑任务峰值
BF16	62.9 GB	35.3 GB	33.2 GB	41.7 GB
FP8	32.5 GB	35.3 GB	33.3 GB	41.8 GB

注：FP8 将静态模型权重内存减半（加载时节省约 30 GB）。推理峰值内存相近，因为生成过程中激活值占主导。

快速开始

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "inclusionAI/LLaDA2.0-Uni-FP8"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="cuda", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Text-to-Image Generation
result = model.generate_image(
    "A cat sitting on a windowsill at sunset",
    image_h=1024, image_w=1024,
    steps=16, cfg_scale=4.0,
)

# Decode VQ tokens to image
from decoder import decode_vq_tokens
image = decode_vq_tokens(
    result["token_ids"], result["h"], result["w"],
    model_path, "cuda",
    num_steps=8, decode_mode="decoder-turbo",
)
image.save("output.png")

📖 BibTeX

@article{LLaDA2Uni,
title = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},
author = {Tiwei Bie and Haoxing Chen and Tieyuan Chen and Zhenglin Cheng and Long Cui and Kai Gan and Zhicheng Huang and Zhenzhong Lan and Haoquan Li and Jianguo Li and Tao Lin and Qi Qin and Hongjun Wang and Xiaomei Wang and Haoyuan Wu and Yi Xin and Junbo Zhao},
journal = {arXiv preprint arXiv:2604.20796},
year = {2026}
}