AGI 研究中心,Inclusion AI
本项目是 LLaDA2.0-Uni 的FP8 量化版本,对 MoE 专家权重进行了分块 FP8 量化。这使得模型加载时的 GPU 内存占用减少约 48%,同时保持输出质量。
| 版本 | 模型加载 | 文本转图像峰值 | 理解任务峰值 | 编辑任务峰值 |
|---|---|---|---|---|
| BF16 | 62.9 GB | 35.3 GB | 33.2 GB | 41.7 GB |
| FP8 | 32.5 GB | 35.3 GB | 33.3 GB | 41.8 GB |
注:FP8 将静态模型权重内存减半(加载时节省约 30 GB)。推理峰值内存相近,因为生成过程中激活值占主导。
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "inclusionAI/LLaDA2.0-Uni-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", trust_remote_code=True
).eval()
model.tokenizer = tokenizer
# Text-to-Image Generation
result = model.generate_image(
"A cat sitting on a windowsill at sunset",
image_h=1024, image_w=1024,
steps=16, cfg_scale=4.0,
)
# Decode VQ tokens to image
from decoder import decode_vq_tokens
image = decode_vq_tokens(
result["token_ids"], result["h"], result["w"],
model_path, "cuda",
num_steps=8, decode_mode="decoder-turbo",
)
image.save("output.png")与基础模型 LLaDA2.0-Uni 相同:
本项目基于 Apache License 2.0 许可证授权。
@article{LLaDA2Uni,
title = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},
author = {Tiwei Bie and Haoxing Chen and Tieyuan Chen and Zhenglin Cheng and Long Cui and Kai Gan and Zhicheng Huang and Zhenzhong Lan and Haoquan Li and Jianguo Li and Tao Lin and Qi Qin and Hongjun Wang and Xiaomei Wang and Haoyuan Wu and Yi Xin and Junbo Zhao},
journal = {arXiv preprint arXiv:2604.20796},
year = {2026}
}