ERNIE-Image 是百度开发的文生图扩散模型,基于 Transformer 架构,支持高质量文本到图像生成。
Transformer Config:
- hidden_size: 4096
- num_layers: 36
- num_attention_heads: 32
- ffn_hidden_size: 12288
- in_channels: 128 (latent)
- out_channels: 128
- patch_size: 1
- text_in_dim: 3072
- eps: 1e-06
- qk_layernorm: True
- rope_axes_dim: [32, 48, 48]
- rope_theta: 256混合序列: 模型同时处理 image tokens + text tokens
Sequence Parallel挑战:
NPU硬件:
| 依赖包 | 版本 | 说明 |
|---|---|---|
| cann | 8.5.1 | 昇腾基础库 |
| torch | 2.9.0+cpu | PyTorch基础库 |
| torch_npu | 2.9.0 | 华为NPU适配 |
| diffusers | 0.38.0 | HuggingFace扩散模型库(含修改) |
| transformers | 5.5.3 | 文本编码器依赖 |
| vllm | 0.19.1+empty | vLLM推理框架 |
| vllm_ascend | 0.19.1rc1 | NPU backend |
| vllm-omni | 0.19.0rc1+npu | 多模态扩展框架(含修改) |
| modelscope | 1.36.3 | 模型下载工具 |
| accelerate | 1.12.0 | 分布式推理支持 |
SP=4配置:
parallel_config:
sequence_parallel_size: 4 # 4个NPU并行
ulysses_degree: 4 # Ulysses SP算法
ring_degree: 1
tensor_parallel_size: 1
pipeline_parallel_size: 1
data_parallel_size: 1使用 ModelScope SDK 下载:
# 安装 modelscope
pip install modelscope
# 下载 ERNIE-Image 模型
python3 << 'EOF'
from modelscope import snapshot_download
model_dir = snapshot_download(
'PaddlePaddle/ERNIE-Image',
cache_dir='/opt/data/modelscope/hub'
)
print(f"Model downloaded to: {model_dir}")
EOF权重目录结构:
/opt/data/modelscope/hub/models/PaddlePaddle/ERNIE-Image/
├── transformer/
│ ├── config.json
│ ├── diffusion_pytorch_model.safetensors
│ └── model.safetensors.index.json
├── vae/
│ ├── config.json
│ └── diffusion_pytorch_model.safetensors
├── text_encoder/
│ ├── config.json
│ ├── model.safetensors
│ └── tokenizer_config.json
├── scheduler/
│ └── scheduler_config.jsondocker pull m.daocloud.io/quay.io/ascend/vllm-ascend:v0.19.1rc1-openeulerdocker run -it -u root -d --net=host \
--privileged --ipc=host \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /opt/data/modelscope:/opt/data/modelscope --name ernie-image \
m.daocloud.io/quay.io/ascend/vllm-ascend:v0.19.1rc1-openeuler /bin/bashERNIE-Image 的正确部署需要对两个核心库进行修改:
位置: /usr/local/python3.11.14/lib/python3.11/site-packages/diffusers/models/transformers/transformer_ernie_image.py
修改内容: 在 ErnieImageSingleStreamAttnProcessor.__call__ 中添加NPU优化的rotary embedding
Patch: 见 patches/diffusers_transformer_ernie_image.patch
关键代码:
## add by wei (Line 123)
def apply_rotary_emb_npu(x_in: torch.Tensor, freqs_cis: torch.Tensor) -> torch.Tensor:
rot_dim = freqs_cis.shape[-1]
x, x_pass = x_in[..., :rot_dim], x_in[..., rot_dim:]
cos_ = torch.cos(freqs_cis).to(x.dtype)
sin_ = torch.sin(freqs_cis).to(x.dtype)
# Use NPU optimized rotary embedding
out = rotary_position_embedding(x, cos_, sin_, rotated_mode='rotated_half')
return torch.cat((out, x_pass), dim=-1)
# Auto-select NPU or CPU implementation
if freqs_cis is not None:
if is_torch_npu_available():
query = apply_rotary_emb_npu(query, freqs_cis)
key = apply_rotary_emb_npu(key, freqs_cis)
else:
query = apply_rotary_emb(query, freqs_cis)
key = apply_rotary_emb(key, freqs_cis)位置: /usr/local/python3.11.14/lib/python3.11/site-packages/diffusers/models/attention_dispatch.py
修改内容: 添加laser attention判断和NPU mask广播函数
Patch: 见 patches/diffusers_attention_dispatch.patch
关键代码:
## add by wei (Line 3311)
def is_supported_laser_attention(head_dim, q_seqlen, kv_seqlen):
MAX_DIM = 128
MIN_SEQLEN_SELF = 4000
MIN_SEQLEN_CROSS = 118404
MAX_SEQLEN_CROSS = 119056
if head_dim > MAX_DIM:
return False
if q_seqlen == kv_seqlen:
return q_seqlen >= MIN_SEQLEN_SELF
else:
return (MIN_SEQLEN_CROSS <= q_seqlen <= MAX_SEQLEN_CROSS) and \
(MIN_SEQLEN_CROSS <= kv_seqlen <= MAX_SEQLEN_CROSS)
def _broadcast_attn_mask_npu(query, key, attn_mask):
if attn_mask is not None:
if attn_mask.ndim == 2 and attn_mask.shape[0] == query.shape[0] and attn_mask.shape[1] == key.shape[1]:
batch_size, seq_len_q, seq_len_kv = attn_mask.shape[0], query.shape[1], key.shape[1]
attn_mask = attn_mask.unsqueeze(1).expand(batch_size, seq_len_q, seq_len_kv).unsqueeze(1).contiguous()
elif attn_mask.ndim == 4 and attn_mask.shape[1:3] == (1, 1):
attn_mask = attn_mask.expand(-1, -1, query.shape[1], -1).contiguous()
return attn_mask位置: /vllm-workspace/vllm-omni/vllm_omni/diffusion/models/ernie_image/pipeline_ernie_image.py
主要修改:
Patch: 见 patches/vllm_omni_pipeline_ernie_image.patch
位置: /vllm-workspace/vllm-omni/vllm_omni/diffusion/models/ernie_image/ernie_image_transformer.py
主要修改:
_sp_plan["x_embedder"]配置,改为手动shardPatch: 见 patches/vllm_omni_ernie_image_transformer.patch
位置: /vllm-workspace/vllm-omni/vllm_omni/diffusion/models/ernie_image/ulysses_attention.py
主要修改(关键bug修复):
[B, S, C](而非启发式判断)核心修复逻辑:
# Step 1: Gather hidden_states
B, S_local, C = hidden_states.shape
# Convert to sequence-first for gather
hidden_states = hidden_states.transpose(0, 1) # [B, S, C] -> [S, B, C]
hidden_states_full = sp_gather(hidden_states, dim=0) # Gather along sequence
hidden_states_full = hidden_states_full.transpose(0, 1) # Back to [B, S_full, C]
# Step 2: Gather rotary embedding (same process)
cos = cos.transpose(0, 1)
cos_full = sp_gather(cos, dim=0)
cos_full = cos_full.transpose(0, 1)
# Step 3: Create full attention mask
mask_full = torch.ones((B, 1, S_full, S_full), ...)
# Step 4: Compute attention on full sequence
output_full = processor(attn, hidden_states_full, mask_full, rotary_full)
# Step 5: Scatter back
output_full = output_full.transpose(0, 1)
output_local = sp_shard(output_full, dim=0)
output_local = output_local.transpose(0, 1)Patch: 见 patches/vllm_omni_ulysses_attention.patch
# 1. 修改 diffusers (需要root权限)
sudo cp patches/diffusers_transformer_ernie_image.patch \
/usr/local/python3.11.14/lib/python3.11/site-packages/diffusers/models/transformers/
sudo patch -p0 < patches/diffusers_transformer_ernie_image.patch
sudo cp patches/diffusers_attention_dispatch.patch \
/usr/local/python3.11.14/lib/python3.11/site-packages/diffusers/models/
sudo patch -p0 < patches/diffusers_attention_dispatch.patch
# 2. 修改 vllm-omni
cd /vllm-workspace/vllm-omni/vllm_omni/diffusion/models/ernie_image/
patch -p0 < patches/vllm_omni_pipeline_ernie_image.patch
patch -p0 < patches/vllm_omni_ernie_image_transformer.patch
patch -p0 < patches/vllm_omni_ulysses_attention.patch单卡启动:
cd /vllm-workspace/ernie-image/bin
# 单卡配置文件(需自行创建)
vllm serve /opt/data/modelscope/hub/models/PaddlePaddle/ERNIE-Image \
--config config/ernie_stage_single.yaml \
--port 8000 \
--dtype bfloat16SP=4启动:
cd /vllm-workspace/ernie-image/bin
# 使用SP=4配置启动
bash start_ernie.shstart_ernie.sh 内容:
#!/bin/bash
echo "Starting ERNIE-Image (v0.19.0rc1 + diffusers adapter)..."
echo "Model: /opt/data/modelscope/hub/models/PaddlePaddle/ERNIE-Image"
echo "Port: 8000"
echo "Config: /vllm-workspace/ernie-image/config/ernie_stage_sp4_custom.yaml"
vllm serve /opt/data/modelscope/hub/models/PaddlePaddle/ERNIE-Image \
--config /vllm-workspace/ernie-image/config/ernie_stage_sp4_custom.yaml \
--port 8000 \
--dtype bfloat16服务验证:
# 检查服务状态
curl http://localhost:8000/v1/models
# 测试生成
curl -X POST http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "A beautiful sunset over the ocean",
"size": "1024x1024",
"num_inference_steps": 50,
"guidance_scale": 4.0
}'优化效果:1024x1024 离线推理:79s(单卡) -> 28s(4卡)
优化内容: 使用NPU专用的 rotary_position_embedding 函数替代普通实现
原因:
实现方式:
# 检测NPU可用性,自动选择优化版本
if is_torch_npu_available():
query = apply_rotary_emb_npu(query, freqs_cis)
key = apply_rotary_emb_npu(key, freqs_cis)
else:
query = apply_rotary_emb(query, freqs_cis) # CPU fallback优化内容: 添加laser attention判断和NPU mask广播优化
原因:
适用条件:
优化内容: 使用Ulysses算法实现SP(而非简单的数据分割)
原理:
优势:
问题: 原实现误判 [B, S, C] 为 [S, B, C]
修复:
[B, S, C](diffusers源码Line 286)优化内容: 在关键点同步latent到所有rank
时机:
原因:
实现:
if sp_size > 1:
# Broadcast from rank 0 to all ranks
dist.broadcast(latents, src=0, group=get_sp_group().device_group)问题: 尺寸1328x1328等导致N_img不能被4整除
优化:
实现:
if N_img % sp_size != 0:
pad_size = sp_size - (N_img % sp_size)
# Pad latent, position_ids, attention_mask
img_sbh = torch.cat([img_sbh, padding], dim=0)
# After gather, remove padding
x_img_full = sp_gather(x_img_local, dim=0)
if N_img_padded != N_img:
x_img_full = x_img_full[:N_img] # Keep only original配置文件: config/ernie_stage_sp4_custom.yaml
cache_backend: "cache_dit"
cache_config:
Fn_compute_blocks: 1 # 前向计算块数
Bn_compute_blocks: 1 # 后向计算块数
max_warmup_steps: 5 # 预热步数
max_cached_steps: 20 # 最大缓存步数
max_continuous_cached_steps: 10 # 连续缓存步数
residual_diff_threshold: 0.05 # 残差阈值(已验证最优)
enable_cache_dit_summary: true # 启用统计日志核心思想: 缓存Transformer中间层计算结果,避免重复计算
ERNIE-Image适配 (cache_dit_backend.py:676-747):
def enable_cache_for_ernie_image(pipeline, cache_config):
"""为ErnieImagePipeline启用cache-dit"""
# 1. 使用BlockAdapter包装transformer
cache_dit.enable_cache(
BlockAdapter(
transformer=transformer,
blocks=transformer.layers, # 36层transformer blocks
forward_pattern=ForwardPattern.Pattern_3, # 单张量输入输出
params_modifiers=[modifier],
check_forward_pattern=False, # ERNIE-Image特殊forward签名
),
cache_config=db_cache_config,
)
# 2. 返回refresh函数(动态更新步数)
def refresh_cache_context(pipeline, num_inference_steps):
cache_dit.refresh_context(transformer, num_inference_steps=num_inference_steps)
return refresh_cache_context关键参数说明:
ForwardPattern.Pattern_3:
hidden_states -> hidden_statescheck_forward_pattern=False:
residual_diff_threshold=0.05:
性能提升:
缓存机制:
Cache-DiT Summary:
- Total steps: 50
- Cached steps: 35 (70%)
- Computed steps: 15 (30%)
- Cache hit rate: 0.85关键设计: cache-dit在transformer层缓存,与SP无关
启用详细日志:
enable_cache_dit_summary: true检查点:
常见问题:
问题描述:
根本原因:
[B, S, C] 判断为 [S, B, C]解决方案:
[B, S, C])[B, S, C] → transpose → [S, B, C] → gather → [S_full, B, C] → transpose → [B, S_full, C]验证结果:
问题描述:
解决方案:
验证结果:
问题描述:
原因:
解决方案:
_sp_plan["x_embedder"] 自动shard验证结果:
问题描述:
[B, 1, S_local, S_local][B, S_full, C]解决方案:
[B, 1, S_full, S_full]验证结果:
问题描述:
_sp_plan["x_embedder"]["split_output"] 自动shardsp_shard解决方案:
_sp_plan["x_embedder"] 和 ["final_linear"]验证结果:
见 /vllm-workspace/ernie-image/benchmark_ernie_image.py
测试内容:
单卡配置: config/ernie_stage_single.yaml(需创建)
SP=4配置: config/ernie_stage_sp4_custom.yaml
patches/
├── diffusers_transformer_ernie_image.patch
├── diffusers_attention_dispatch.patch
├── vllm_omni_pipeline_ernie_image.patch
├── vllm_omni_ernie_image_transformer.patch
└── vllm_omni_ulysses_attention.patch