Z-Image-Turbo-SDA on Ascend NPU

1. 简介

本文档记录 Z-Image-Turbo-SDA 在 Ascend NPU (Atlas 800 A2/A3) 环境的快速部署与验证结果。

Z-Image-Turbo-SDA 是基于 Tongyi-MAI/Z-Image-Turbo 的 LoKr (Low-Rank Kronecker Product) 适配器，通过 Semantic Directional Alignment (SDA) 技术解决了少步蒸馏模型的 "Diversity Collapse" 问题。该适配器在保持 8 步推理速度的同时，恢复了原始教师模型 70.2% 的组合多样性 (LPIPS)。

模型组件：

Base Model: Tongyi-MAI/Z-Image-Turbo (8-Step Flow Matching, 6B 参数)
Text Encoder: Qwen3ForCausalLM (4B)
VAE: AutoencoderKL
Scheduler: FlowMatchEulerDiscreteScheduler
Adapter: LoKr (SDA Diversity Recovery, ~162MB)

主要特点：

恢复蒸馏模型的生成多样性，不同种子产生不同构图
保持 8 步快速推理速度
兼容其他 Z-Image LoRA 和 ControlNet

2. 验证环境

组件	版本
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`diffusers`	`0.38.0`
`transformers`	`4.57.6`
`peft`	`0.19.1`

NPU：Ascend910B4
基础模型路径：/tmp/models/Z-Image-Turbo
适配器路径：/opt/atomgit/z_image_turbo_sda/model/adapter/zit_sda_v1.safetensors
支持的分辨率：512×512 到 2048×2048

3. 服务启动

3.1 环境配置

# 设置 NPU 设备
export ASCEND_RT_VISIBLE_DEVICES=0

# 设置内存分配优化
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# 设置 HCCL 缓冲区大小
export HCCL_BUFFSIZE=512

# 设置线程绑定
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1

3.2 模型下载

# 下载基础模型
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download \
    --resume-download Tongyi-MAI/Z-Image-Turbo \
    --local-dir /tmp/models/Z-Image-Turbo

# 下载 SDA LoKr 适配器
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download \
    --resume-download F16/z-image-turbo-sda \
    --local-dir /opt/atomgit/z_image_turbo_sda/model

3.3 推理脚本

基础推理命令：

python inference.py \
    --base_model_path /tmp/models/Z-Image-Turbo \
    --adapter_path /opt/atomgit/z_image_turbo_sda/model/adapter/zit_sda_v1.safetensors \
    --prompt "A young woman standing on a sunny coastline, white dress fluttering in the sea breeze." \
    --height 1024 \
    --width 1024 \
    --num_inference_steps 8 \
    --guidance_scale 1.0 \
    --adapter_weight 1.0 \
    --dtype bfloat16 \
    --device npu:0 \
    --output output

3.4 参数说明

参数	默认值	说明
`--base_model_path`	`/tmp/models/Z-Image-Turbo`	基础模型路径
`--adapter_path`	`model/adapter/zit_sda_v1.safetensors`	LoKr 适配器路径
`--adapter_weight`	`1.0`	适配器权重 (建议 0.5~1.0)
`--prompt`	-	文本提示词
`--height`	`1024`	图像高度
`--width`	`1024`	图像宽度
`--num_inference_steps`	`8`	推理步数
`--guidance_scale`	`1.0`	CFG 引导强度
`--device`	`npu:0`	设备
`--warmup_runs`	`1`	预热次数
`--benchmark_runs`	`3`	基准测试次数

4. Smoke 验证

4.1 基础功能检查

# 检查模型加载
python -c "
import torch
import torch_npu
from diffusers import ZImagePipeline
from safetensors.torch import load_file

pipe = ZImagePipeline.from_pretrained('/tmp/models/Z-Image-Turbo', torch_dtype=torch.bfloat16)
pipe = pipe.to('npu:0')

adapter = load_file('/opt/atomgit/z_image_turbo_sda/model/adapter/zit_sda_v1.safetensors')
pipe.load_lora_weights(adapter, adapter_name='sda_diversity')
pipe.set_adapters(['sda_diversity'], adapter_weights=[1.0])
print('Model loaded successfully!')
print(f'Device: {pipe.device}')
print(f'Dtype: {pipe.dtype}')
"

4.2 推理测试

# 快速推理测试 (512x512, 4步)
python inference.py \
    --height 512 \
    --width 512 \
    --num_inference_steps 4 \
    --warmup_runs 1 \
    --benchmark_runs 1 \
    --output output/smoke_test

验证结果：

模型加载成功
LoKr 适配器加载正常
推理过程无报错
输出图像尺寸正确

5. 性能参考

5.1 测试条件

参数	值
分辨率	`1024×1024`
推理步数	`8`
Guidance Scale	`1.0`
Adapter Weight	`1.0`
数据类型	`bfloat16`
设备	`Ascend910B4`

5.2 性能指标

运行性能评测脚本：

python perf_eval.py \
    --height 1024 \
    --width 1024 \
    --num_inference_steps 8 \
    --num_warmup 2 \
    --num_iterations 10 \
    --output perf_results.json

参考性能数据（基于同架构 Z-Image-Fun-Lora-Distill 参考值）：

指标	参考值
512×512 4步	~0.76s, ~1.31 img/s
1024×1024 4步	~2.19s, ~0.46 img/s
显存占用	~20GB
显存预留	~24GB

注：实际性能可能因硬件配置和系统负载而异。

7. 注意事项

7.1 显存要求

最低显存：20GB
推荐显存：24GB+
支持的分辨率范围：512×512 到 2048×2048

7.2 推理参数建议

场景	Steps	CFG	Adapter Weight	说明
最大多样性	8	1.0	1.0	完整 SDA 效果
平衡模式	8	1.0	0.7	适中多样性
配合其他 LoRA	8	1.0	0.5	降低冲突风险
快速预览	4	1.0	1.0	速度优先

7.3 常见问题

显存不足：降低分辨率或减少适配器权重
与其他 LoRA 冲突：降低 SDA 适配器权重至 0.5~0.7
简单 prompt 多样性更明显：这是正常现象，复杂 prompt 有更多构图约束

7.4 NPU 特定优化

使用 bfloat16 数据类型以获得最佳性能
启用 PYTORCH_NPU_ALLOC_CONF=expandable_segments:True 优化内存分配
设置 HCCL_BUFFSIZE=512 优化通信缓冲区
Generator 使用 device="cpu" 以确保种子可复现

8. 文件说明

文件	说明
`inference.py`	推理脚本 (Smoke 验证 + 基准测试)
`perf_eval.py`	性能评测脚本
`accuracy_eval.py`	精度评测脚本
`model/adapter/zit_sda_v1.safetensors`	SDA LoKr 适配器权重
`readme.md`	部署文档

适配信息

适配日期: 2026-05-10
适配设备: 华为昇腾 910B4 NPU
适配状态: 已完成验证
主要修改: 基于 diffusers.ZImagePipeline 加载基础模型，通过 peft 加载 LoKr 适配器

Ascend NPU 精度评测

NPU vs CPU 精度对比（CPU 为基线，NPU 为验证目标）：

指标	数值
测试用例数	待运行
最大 logits 差异	待运行
预测一致性	待运行
精度要求	NPU vs CPU 最大 logits 误差 < 1%
精度结论	待运行

精度评测源代码和日志详见 eval/ 目录。