智谱AI/GLM-Image
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

GLM-Image

👋 加入我们的 微信 和 Discord 社区
📖 查阅 GLM-Image 的 技术博客 和 GitHub 仓库
📍 使用 GLM-Image 的 API

show_case

简介

GLM-Image 是一款采用混合自回归+扩散解码器架构的图像生成模型。在通用图像生成质量方面,GLM-Image 与主流潜在扩散方法相当,但在文字渲染和知识密集型生成场景中展现出显著优势。它在需要精确语义理解和复杂信息表达的任务上表现尤为出色,同时保持了高保真度和细粒度细节生成的强大能力。除了文本到图像生成外,GLM-Image 还支持丰富的图像到图像任务,包括图像编辑、风格迁移、身份保持生成以及多主体一致性。

模型架构:混合自回归+扩散解码器设计。

architecture_1

  • 自回归生成器:90亿参数模型,基于 GLM-4-9B-0414 初始化,并扩展词汇表以纳入视觉令牌。该模型首先生成约256个令牌的紧凑编码,然后扩展到1K–4K个令牌,对应1K–2K高分辨率图像输出。
  • 扩散解码器:70亿参数解码器,基于单流 DiT 架构进行 latent 空间图像解码。它配备了 Glyph Encoder 文本模块,显著提升了图像内精确文字的渲染能力。

architecture_2

采用解耦强化学习进行后训练:模型引入基于 GRPO 算法的细粒度、模块化反馈策略,大幅提升了语义理解和视觉细节质量。

  • 自回归模块:提供侧重于美学和语义对齐的低频反馈信号,改善指令遵循和艺术表现力。
  • 解码器模块:提供针对细节保真度和文本准确性的高频反馈,生成高度真实的纹理以及更精确的文字渲染。

GLM-Image 在单个模型内同时支持文本到图像和图像到图像生成。

  • 文本到图像:从文本描述生成高细节图像,在信息密集型场景中表现尤为突出。
  • 图像到图像:支持广泛的任务,包括图像编辑、风格迁移、多主体一致性以及人物和物体的身份保持生成。

您可以在 transformers 和 diffusers 库中找到完整的 GLM-Image 模型实现。

展示案例

含密集文本与知识的文本生成图像

show_case_t2i

图像生成图像

show_case_i2i

快速开始

transformers + diffusers 流水线

从源码安装 transformers 和 diffusers:

pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.git
  • 文本到图像生成
import torch
from diffusers.pipelines.glm_image import GlmImagePipeline

pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image", torch_dtype=torch.bfloat16, device_map="cuda")
prompt = "A beautifully designed modern food magazine style dessert recipe illustration, themed around a raspberry mousse cake. The overall layout is clean and bright, divided into four main areas: the top left features a bold black title 'Raspberry Mousse Cake Recipe Guide', with a soft-lit close-up photo of the finished cake on the right, showcasing a light pink cake adorned with fresh raspberries and mint leaves; the bottom left contains an ingredient list section, titled 'Ingredients' in a simple font, listing 'Flour 150g', 'Eggs 3', 'Sugar 120g', 'Raspberry puree 200g', 'Gelatin sheets 10g', 'Whipping cream 300ml', and 'Fresh raspberries', each accompanied by minimalist line icons (like a flour bag, eggs, sugar jar, etc.); the bottom right displays four equally sized step boxes, each containing high-definition macro photos and corresponding instructions, arranged from top to bottom as follows: Step 1 shows a whisk whipping white foam (with the instruction 'Whip egg whites to stiff peaks'), Step 2 shows a red-and-white mixture being folded with a spatula (with the instruction 'Gently fold in the puree and batter'), Step 3 shows pink liquid being poured into a round mold (with the instruction 'Pour into mold and chill for 4 hours'), Step 4 shows the finished cake decorated with raspberries and mint leaves (with the instruction 'Decorate with raspberries and mint'); a light brown information bar runs along the bottom edge, with icons on the left representing 'Preparation time: 30 minutes', 'Cooking time: 20 minutes', and 'Servings: 8'. The overall color scheme is dominated by creamy white and light pink, with a subtle paper texture in the background, featuring compact and orderly text and image layout with clear information hierarchy."
image = pipe(
    prompt=prompt,
    height=32 * 32,
    width=36 * 32,
    num_inference_steps=50,
    guidance_scale=1.5,
    generator=torch.Generator(device="cuda").manual_seed(42),
).images[0]

image.save("output_t2i.png")
  • 图像到图像生成
import torch
from diffusers.pipelines.glm_image import GlmImagePipeline
from PIL import Image

pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image", torch_dtype=torch.bfloat16, device_map="cuda")
image_path = "cond.jpg"
prompt = "Replace the background of the snow forest with an underground station featuring an automatic escalator."
image = Image.open(image_path).convert("RGB")
image = pipe(
    prompt=prompt,
    image=[image],  # can input multiple images for multi-image-to-image generation such as [image, image1]
    height=33 * 32, # Must set height even it is same as input image
    width=32 * 32, # Must set width even it is same as input image
    num_inference_steps=50,
    guidance_scale=1.5,
    generator=torch.Generator(device="cuda").manual_seed(42),
).images[0]

image.save("output_i2i.png")

SGLang 流水线

从源代码安装 transformers 和 diffusers:

pip install "sglang[diffusion] @ git+https://github.com/sgl-project/sglang.git#subdirectory=python"
pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.git
  • 文本到图像生成
sglang serve --model-path zai-org/GLM-Image

curl http://localhost:30000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-Image",
    "prompt": "a beautiful girl with glasses.",
    "n": 1,
    "response_format": "b64_json",
    "size": "1024x1024"
  }' |  python3 -c "import sys, json, base64; open('output_t2i.png', 'wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))"
  • 图像到图像生成
sglang serve --model-path zai-org/GLM-Image

curl -s -X POST "http://localhost:30000/v1/images/edits" \
-F "model=zai-org/GLM-Image" \
-F "image=@cond.jpg" \
-F "prompt=Replace the background of the snow forest with an underground station featuring an automatic escalator." \
-F "response_format=b64_json"  | python3 -c "import sys, json, base64; open('output_i2i.png', 'wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))"

注意事项

  • 请确保所有需要在图像中渲染的文本在模型输入时都用引号括起来,并且强烈建议使用 GLM-4.7 来优化提示词,以获得更高的图像质量。更多详情请查看 我们的 github 脚本。
  • GLM‑Image 中使用的 AR 模型默认配置为 do_sample=True、temperature 为 0.9、topp 为 0.75。较高的 temperature 会使输出更加多样和丰富,但也可能导致输出稳定性有所下降。
  • 目标图像分辨率必须能被 32 整除,否则会抛出错误。
  • 由于目前该架构的推理优化有限,运行成本仍然较高。您可以设置 enable_model_cpu_offload=True,以约 23GB 的 GPU 内存运行,但会以降低推理速度为代价。
  • 目前正在集成 vLLM-Omni 和 SGLang(支持 AR 加速)——敬请期待。关于推理成本,您可以在我们的 github 上查询。

模型性能

文本渲染

模型开源CVTG-2KLongText-Bench
单词准确率NEDCLIPScore平均值英文中文
Seedream 4.5✗0.89900.94830.80690.9880.9890.987
Seedream 4.0✗0.84510.92240.79750.9240.9210.926
Nano Banana 2.0✗0.77880.87540.73720.9650.9810.949
GPT Image 1 [High]✗0.85690.94780.79820.7880.9560.619
Qwen-Image✓0.82880.91160.80170.9450.9430.946
Qwen-Image-2512✓0.86040.92900.78190.9610.9560.965
Z-Image✓0.86710.93670.79690.9360.9350.936
Z-Image-Turbo✓0.85850.92810.80480.9220.9170.926
GLM-Image✓0.91160.95570.78770.9660.9520.979

文本到图像

模型开源OneIG-BenchTIIF-BenchDPG-Bench
英文中文短文本长文本
Seedream 4.5✗0.5760.55190.4988.5288.63
Seedream 4.0✗0.5760.55390.4588.0888.54
Nano Banana 2.0✗0.5780.56791.0088.2687.16
GPT Image 1 [High]✗0.5330.47489.1588.2985.15
DALL-E 3✗--74.9670.8183.50
Qwen-Image✓0.5390.54886.1486.8388.32
Qwen-Image-2512✓0.5300.51583.2484.9387.20
Z-Image✓0.5460.53580.2083.0188.14
Z-Image-Turbo✓0.5280.50777.7380.0584.86
FLUX.1 [Dev]✓0.434-71.0971.7883.52
SD3 Medium✓--67.4666.0984.08
SD XL✓0.316-54.9642.1374.65
BAGEL✓0.3610.37071.5071.70-
Janus-Pro✓0.2670.24066.5065.0184.19
Show-o2✓0.308-59.7258.86-
GLM-Image✓0.5280.51181.0181.0284.78

许可协议

GLM-Image 模型整体基于 MIT 许可协议发布。

本项目整合了来自 X-Omni/X-Omni-En 的 VQ 分词器权重和 VIT 权重,这些权重基于 Apache License, Version 2.0 许可协议授权。

VQ 分词器和 VIT 权重仍受原始 Apache-2.0 条款的约束。用户在使用此组件时应遵守相应的许可协议。