👋 加入我们的 微信 和 Discord 社区
📖 查阅 GLM-Image 的 技术博客 和 GitHub 仓库
📍 使用 GLM-Image 的 API
GLM-Image 是一款采用混合自回归+扩散解码器架构的图像生成模型。在通用图像生成质量方面,GLM-Image 与主流潜在扩散方法相当,但在文字渲染和知识密集型生成场景中展现出显著优势。它在需要精确语义理解和复杂信息表达的任务上表现尤为出色,同时保持了高保真度和细粒度细节生成的强大能力。除了文本到图像生成外,GLM-Image 还支持丰富的图像到图像任务,包括图像编辑、风格迁移、身份保持生成以及多主体一致性。
模型架构:混合自回归+扩散解码器设计。
采用解耦强化学习进行后训练:模型引入基于 GRPO 算法的细粒度、模块化反馈策略,大幅提升了语义理解和视觉细节质量。
GLM-Image 在单个模型内同时支持文本到图像和图像到图像生成。
您可以在 transformers 和 diffusers 库中找到完整的 GLM-Image 模型实现。
从源码安装 transformers 和 diffusers:
pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.gitimport torch
from diffusers.pipelines.glm_image import GlmImagePipeline
pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image", torch_dtype=torch.bfloat16, device_map="cuda")
prompt = "A beautifully designed modern food magazine style dessert recipe illustration, themed around a raspberry mousse cake. The overall layout is clean and bright, divided into four main areas: the top left features a bold black title 'Raspberry Mousse Cake Recipe Guide', with a soft-lit close-up photo of the finished cake on the right, showcasing a light pink cake adorned with fresh raspberries and mint leaves; the bottom left contains an ingredient list section, titled 'Ingredients' in a simple font, listing 'Flour 150g', 'Eggs 3', 'Sugar 120g', 'Raspberry puree 200g', 'Gelatin sheets 10g', 'Whipping cream 300ml', and 'Fresh raspberries', each accompanied by minimalist line icons (like a flour bag, eggs, sugar jar, etc.); the bottom right displays four equally sized step boxes, each containing high-definition macro photos and corresponding instructions, arranged from top to bottom as follows: Step 1 shows a whisk whipping white foam (with the instruction 'Whip egg whites to stiff peaks'), Step 2 shows a red-and-white mixture being folded with a spatula (with the instruction 'Gently fold in the puree and batter'), Step 3 shows pink liquid being poured into a round mold (with the instruction 'Pour into mold and chill for 4 hours'), Step 4 shows the finished cake decorated with raspberries and mint leaves (with the instruction 'Decorate with raspberries and mint'); a light brown information bar runs along the bottom edge, with icons on the left representing 'Preparation time: 30 minutes', 'Cooking time: 20 minutes', and 'Servings: 8'. The overall color scheme is dominated by creamy white and light pink, with a subtle paper texture in the background, featuring compact and orderly text and image layout with clear information hierarchy."
image = pipe(
prompt=prompt,
height=32 * 32,
width=36 * 32,
num_inference_steps=50,
guidance_scale=1.5,
generator=torch.Generator(device="cuda").manual_seed(42),
).images[0]
image.save("output_t2i.png")import torch
from diffusers.pipelines.glm_image import GlmImagePipeline
from PIL import Image
pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image", torch_dtype=torch.bfloat16, device_map="cuda")
image_path = "cond.jpg"
prompt = "Replace the background of the snow forest with an underground station featuring an automatic escalator."
image = Image.open(image_path).convert("RGB")
image = pipe(
prompt=prompt,
image=[image], # can input multiple images for multi-image-to-image generation such as [image, image1]
height=33 * 32, # Must set height even it is same as input image
width=32 * 32, # Must set width even it is same as input image
num_inference_steps=50,
guidance_scale=1.5,
generator=torch.Generator(device="cuda").manual_seed(42),
).images[0]
image.save("output_i2i.png")从源代码安装 transformers 和 diffusers:
pip install "sglang[diffusion] @ git+https://github.com/sgl-project/sglang.git#subdirectory=python"
pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.gitsglang serve --model-path zai-org/GLM-Image
curl http://localhost:30000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "zai-org/GLM-Image",
"prompt": "a beautiful girl with glasses.",
"n": 1,
"response_format": "b64_json",
"size": "1024x1024"
}' | python3 -c "import sys, json, base64; open('output_t2i.png', 'wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))"sglang serve --model-path zai-org/GLM-Image
curl -s -X POST "http://localhost:30000/v1/images/edits" \
-F "model=zai-org/GLM-Image" \
-F "image=@cond.jpg" \
-F "prompt=Replace the background of the snow forest with an underground station featuring an automatic escalator." \
-F "response_format=b64_json" | python3 -c "import sys, json, base64; open('output_i2i.png', 'wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))"do_sample=True、temperature 为 0.9、topp 为 0.75。较高的 temperature 会使输出更加多样和丰富,但也可能导致输出稳定性有所下降。enable_model_cpu_offload=True,以约 23GB 的 GPU 内存运行,但会以降低推理速度为代价。| 模型 | 开源 | CVTG-2K | LongText-Bench | ||||
|---|---|---|---|---|---|---|---|
| 单词准确率 | NED | CLIPScore | 平均值 | 英文 | 中文 | ||
| Seedream 4.5 | ✗ | 0.8990 | 0.9483 | 0.8069 | 0.988 | 0.989 | 0.987 |
| Seedream 4.0 | ✗ | 0.8451 | 0.9224 | 0.7975 | 0.924 | 0.921 | 0.926 |
| Nano Banana 2.0 | ✗ | 0.7788 | 0.8754 | 0.7372 | 0.965 | 0.981 | 0.949 |
| GPT Image 1 [High] | ✗ | 0.8569 | 0.9478 | 0.7982 | 0.788 | 0.956 | 0.619 |
| Qwen-Image | ✓ | 0.8288 | 0.9116 | 0.8017 | 0.945 | 0.943 | 0.946 |
| Qwen-Image-2512 | ✓ | 0.8604 | 0.9290 | 0.7819 | 0.961 | 0.956 | 0.965 |
| Z-Image | ✓ | 0.8671 | 0.9367 | 0.7969 | 0.936 | 0.935 | 0.936 |
| Z-Image-Turbo | ✓ | 0.8585 | 0.9281 | 0.8048 | 0.922 | 0.917 | 0.926 |
| GLM-Image | ✓ | 0.9116 | 0.9557 | 0.7877 | 0.966 | 0.952 | 0.979 |
| 模型 | 开源 | OneIG-Bench | TIIF-Bench | DPG-Bench | ||
|---|---|---|---|---|---|---|
| 英文 | 中文 | 短文本 | 长文本 | |||
| Seedream 4.5 | ✗ | 0.576 | 0.551 | 90.49 | 88.52 | 88.63 |
| Seedream 4.0 | ✗ | 0.576 | 0.553 | 90.45 | 88.08 | 88.54 |
| Nano Banana 2.0 | ✗ | 0.578 | 0.567 | 91.00 | 88.26 | 87.16 |
| GPT Image 1 [High] | ✗ | 0.533 | 0.474 | 89.15 | 88.29 | 85.15 |
| DALL-E 3 | ✗ | - | - | 74.96 | 70.81 | 83.50 |
| Qwen-Image | ✓ | 0.539 | 0.548 | 86.14 | 86.83 | 88.32 |
| Qwen-Image-2512 | ✓ | 0.530 | 0.515 | 83.24 | 84.93 | 87.20 |
| Z-Image | ✓ | 0.546 | 0.535 | 80.20 | 83.01 | 88.14 |
| Z-Image-Turbo | ✓ | 0.528 | 0.507 | 77.73 | 80.05 | 84.86 |
| FLUX.1 [Dev] | ✓ | 0.434 | - | 71.09 | 71.78 | 83.52 |
| SD3 Medium | ✓ | - | - | 67.46 | 66.09 | 84.08 |
| SD XL | ✓ | 0.316 | - | 54.96 | 42.13 | 74.65 |
| BAGEL | ✓ | 0.361 | 0.370 | 71.50 | 71.70 | - |
| Janus-Pro | ✓ | 0.267 | 0.240 | 66.50 | 65.01 | 84.19 |
| Show-o2 | ✓ | 0.308 | - | 59.72 | 58.86 | - |
| GLM-Image | ✓ | 0.528 | 0.511 | 81.01 | 81.02 | 84.78 |
GLM-Image 模型整体基于 MIT 许可协议发布。
本项目整合了来自 X-Omni/X-Omni-En 的 VQ 分词器权重和 VIT 权重,这些权重基于 Apache License, Version 2.0 许可协议授权。
VQ 分词器和 VIT 权重仍受原始 Apache-2.0 条款的约束。用户在使用此组件时应遵守相应的许可协议。