HunyuanVideo模型推理rainfusion算子适配优化

目前魔乐社区以及gitee上的昇腾适配HunyuanVideo模型未适配rainfusion稀疏优化方法，本项目基于gitee上hunyuanVideo的昇腾迁移项目，适配该稀疏算法，gitee项目地址链接https://gitee.com/ascend/ModelZoo-PyTorch/blob/master/MindIE/MultiModal/HunyuanVideo/README.md 注：该算法为有损优化算法，不可与其它优化算子叠加。

修改代码清单

/hyvideo/modules/attention.py /hyvideo/modules/attn_layer.py /hyvideo/modules/models.py

运行环境准备

配套	版本	环境准备指导
Python	3.11.6	-
torch	2.1.0	-
torch_npu	2.1.0.post13	-
diffusers	0.35.1	-
MindIE	2.1.RC1	-
CANN	8.2.0	-

Mindie2.0.T18以后才支持rainfusion稀疏

执行安装项目中的reqiurements.txt

pip install -r reqiurements.txt

推理执行

4卡并行推理执行脚本

source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/llm_model/set_env.sh
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export TASK_QUEUE_ENABLE=2
export CPU_AFFINITY_CONF=1
export TOKENIZERS_PARALLELISM=false
export ALGO=3
torchrun --nproc_per_node=4  sample_video.py \
      --model-base /path/to/model/weight \
      --dit-weight /path/to/transformers/weight \
      --vae-path /path/to/vae/weight \
      --text-encoder-path /path/to/text_encoder/weight \
      --text-encoder-2-path /path/to/vit/weight \
      --model-resolution "720p" \
      --video-size 720 1280 \
      --video-length 129 \
      --infer-steps 50 \
      --prompt ./prompts2.txt \
      --seed 42 \
      --ulysses-degree 4 \
      --ring-degree 1 \
      --flow-reverse \
      --embedded-cfg-scale 6.0 \
      --flow-shift 7.0 \
      --save-path ./results_0708/4_1_129_720p_rainfusion_cache_sparse_0.8_new_test

ALGO: 为0表示默认FA算子；设置为1表示使用高性能FA算子；设置为2表示使用LA算子；设置为3表示使用rainfusion算子 video-size: 生成视频的高和宽 video-length：生成视频帧长 infer-steps：扩散步数 prompt: 文本提示词 seed: 随机种子 ulysses-degree: ulysses并行数 ring-degree: ring并行数

执行结果

910B2 4卡序列并行分辨率 720*1280 step 50 帧长 129

sparsity	单次推理耗时
0.8	390s
0.7	403s
0.64	417s
0.5	443s

HunyuanVideo模型推理rainfusion算子适配优化

修改代码清单

/hyvideo/modules/attention.py /hyvideo/modules/attn_layer.py /hyvideo/modules/models.py

配套

版本

环境准备指导

Python

3.11.6

torch

2.1.0

torch_npu

2.1.0.post13

diffusers

0.35.1

MindIE

2.1.RC1

CANN

8.2.0

推理执行

4卡并行推理执行脚本

source /usr/local/Ascend/mindie/set_env.sh
source /usr/local/Ascend/llm_model/set_env.sh
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export TASK_QUEUE_ENABLE=2
export CPU_AFFINITY_CONF=1
export TOKENIZERS_PARALLELISM=false
export ALGO=3
torchrun --nproc_per_node=4  sample_video.py \
      --model-base /path/to/model/weight \
      --dit-weight /path/to/transformers/weight \
      --vae-path /path/to/vae/weight \
      --text-encoder-path /path/to/text_encoder/weight \
      --text-encoder-2-path /path/to/vit/weight \
      --model-resolution "720p" \
      --video-size 720 1280 \
      --video-length 129 \
      --infer-steps 50 \
      --prompt ./prompts2.txt \
      --seed 42 \
      --ulysses-degree 4 \
      --ring-degree 1 \
      --flow-reverse \
      --embedded-cfg-scale 6.0 \
      --flow-shift 7.0 \
      --save-path ./results_0708/4_1_129_720p_rainfusion_cache_sparse_0.8_new_test

sparsity

单次推理耗时

0.8

390s

0.7

403s

0.64

417s

0.5

443s