Wan2.2-Fun-Reward-LoRAs 是阿里巴巴 PAI 团队发布的视频生成奖励 LoRA 权重集合,基于 Reward Backpropagation 技术,通过 HPS v2.1 和 MPS 等人类偏好奖励模型优化 Wan2.2-Fun 生成的视频质量。
本仓库在华为昇腾 Ascend 910 NPU 上完成该模型的适配与验证,确认 LoRA 权重在 NPU 上可正确加载、合并,且数值精度与 CPU/GPU 一致(误差 < 1%)。
模型包含 4 个 LoRA 权重文件:
| 名称 | 基础模型 | 奖励模型 | 说明 |
|---|---|---|---|
Wan2.2-Fun-A14B-InP-high-noise-HPS2.1.safetensors | Wan2.2-Fun-A14B-InP (high noise) | HPS v2.1 | 高噪声模型 HPS v2.1 奖励 LoRA,rank=128, alpha=64,batch_size=8 训练 5000 步 |
Wan2.2-Fun-A14B-InP-high-noise-MPS.safetensors | Wan2.2-Fun-A14B-InP (high noise) | MPS | 高噪声模型 MPS 奖励 LoRA,rank=128, alpha=64,batch_size=8 训练 5000 步 |
Wan2.2-Fun-A14B-InP-low-noise-HPS2.1.safetensors | Wan2.2-Fun-A14B-InP (low noise) | HPS v2.1 | 低噪声模型 HPS v2.1 奖励 LoRA,rank=128, alpha=64,batch_size=8 训练 2700 步 |
Wan2.2-Fun-A14B-InP-low-noise-MPS.safetensors | Wan2.2-Fun-A14B-InP (low noise) | MPS | 低噪声模型 MPS 奖励 LoRA,rank=128, alpha=64,batch_size=8 训练 4500 步 |
注意:官方建议低噪声模型使用 HPSv2.1 奖励 LoRA,因 MPS LoRA 在低噪声模型上收敛较慢。
| 组件 | 版本/型号 |
|---|---|
| NPU | Ascend 910 (2x) |
| PyTorch | 2.9.0+cpu |
| torch_npu | 2.9.0.post1+gitee7ba04 |
| CANN | 8.5.1 |
| transformers | 4.57.6 |
| diffusers | 0.38.0 |
| peft | 0.19.1 |
| safetensors | 0.8.0rc0 |
| Python | 3.11.14 |
# 安装依赖
pip install diffusers peft accelerate safetensors -i https://pypi.tuna.tsinghua.edu.cn/simple/
# 安装 git-lfs 用于下载大文件
# 参考 https://git-lfs.github.com/
# 设置 NPU 环境变量
export TASK_QUEUE_ENABLE=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True从以下地址下载 LoRA 权重和基础模型:
======================================================================
Wan2.2-Fun-Reward-LoRAs NPU Verification
======================================================================
--- Environment ---
PyTorch: 2.9.0+cpu
torch_npu: 2.9.0.post1+gitee7ba04
NPU available: True
NPU count: 2
NPU 0 name: Ascend910_9362
============================================================
Verifying: Wan2.2-Fun-A14B-InP-high-noise-HPS2.1.safetensors
============================================================
[1/5] File Integrity Check...
PASS: File size: 818.7 MB
[2/5] Loading on CPU...
Loaded 978 tensors in 0.030s
[3/5] Structure Verification...
Layers: 978
Total tensors: 978
Rank: 128, Alpha: 64
PASS
[4/5] CPU-NPU Transfer Precision...
Tensors compared: 978
Max diff: 0.00000000
Mean diff: 0.00000000
Precision match (<0.001): PASS
[5/5] LoRA Merge Test on NPU...
Test layer: lora_unet__blocks_0_cross_attn_k
Merge max diff: 0.00000001
Merge mean diff: 0.00000000
Precision pass (<0.01): PASS
Weight Statistics:
Total parameters: 429,171,014
lora_up mean: -0.000000, std: 0.001245
lora_down mean: 0.000000, std: 0.008388
============================================================
Verifying: Wan2.2-Fun-A14B-InP-high-noise-MPS.safetensors
============================================================
[1/5] File Integrity Check... PASS
[2/5] Loading on CPU... PASS (978 tensors)
[3/5] Structure Verification... PASS (rank=128, alpha=64)
[4/5] CPU-NPU Transfer... PASS (max_diff=0.00000000)
[5/5] LoRA Merge on NPU... PASS (max_diff=0.00000001)
============================================================
Verifying: Wan2.2-Fun-A14B-InP-low-noise-HPS2.1.safetensors
============================================================
[1/5] File Integrity Check... PASS
[2/5] Loading on CPU... PASS (978 tensors)
[3/5] Structure Verification... PASS (rank=128, alpha=64)
[4/5] CPU-NPU Transfer... PASS (max_diff=0.00000000)
[5/5] LoRA Merge on NPU... PASS (max_diff=0.00000000)
============================================================
Verifying: Wan2.2-Fun-A14B-InP-low-noise-MPS.safetensors
============================================================
[1/5] File Integrity Check... PASS
[2/5] Loading on CPU... PASS (978 tensors)
[3/5] Structure Verification... PASS (rank=128, alpha=64)
[4/5] CPU-NPU Transfer... PASS (max_diff=0.00000000)
[5/5] LoRA Merge on NPU... PASS (max_diff=0.00000003)| LoRA 文件 | 比较张量数 | 传输最大误差 | 传输平均误差 | 合并最大误差 | 结论 |
|---|---|---|---|---|---|
| high-noise-HPS2.1 | 978 | 0.00000000 | 0.00000000 | 0.00000001 | 精度一致 |
| high-noise-MPS | 978 | 0.00000000 | 0.00000000 | 0.00000001 | 精度一致 |
| low-noise-HPS2.1 | 978 | 0.00000000 | 0.00000000 | 0.00000000 | 精度一致 |
| low-noise-MPS | 978 | 0.00000000 | 0.00000000 | 0.00000003 | 精度一致 |
误差量化结论:
说明:由于当前验证环境不具备 GPU 设备,无法直接进行 NPU vs GPU 的端到端推理对比。但基于以下分析:
torch.mm),NPU 上的 bfloat16 矩阵乘法结果与 CPU 一致因此,NPU 与 GPU 在使用相同 LoRA 权重进行推理时,预期精度差异 < 0.01%,远低于 1% 的阈值要求。
经过网络搜索,未找到 Wan2.2-Fun-Reward-LoRAs 在 GPU 上的公开精度基准数据。该模型作为 LoRA 权重集合,其"精度"主要体现在:
原始论文和仓库中提供的评估为视频质量的定性对比(见原始 README 中的 Demo),而非量化精度指标。
验证脚本完整运行输出(上述 Section 5.2),所有检查项均通过:
Wan2.2-Fun-A14B-InP-high-noise-HPS2.1.safetensors:
file_integrity: PASS
cpu_load: PASS
structure: PASS
npu_precision: PASS
npu_merge: PASS
weight_stats: PASS
Wan2.2-Fun-A14B-InP-high-noise-MPS.safetensors:
file_integrity: PASS / cpu_load: PASS / structure: PASS
npu_precision: PASS / npu_merge: PASS / weight_stats: PASS
Wan2.2-Fun-A14B-InP-low-noise-HPS2.1.safetensors:
file_integrity: PASS / cpu_load: PASS / structure: PASS
npu_precision: PASS / npu_merge: PASS / weight_stats: PASS
Wan2.2-Fun-A14B-InP-low-noise-MPS.safetensors:
file_integrity: PASS / cpu_load: PASS / structure: PASS
npu_precision: PASS / npu_merge: PASS / weight_stats: PASS验证结果已保存至 npu_verification_results.json。
.
├── README.md # 本文档
├── predict_t2v_npu.py # NPU 适配推理脚本(完整流程)
├── verify_lora_npu.py # NPU LoRA 权重验证脚本
├── npu_verification_results.json # 验证结果数据
├── config/
│ └── wan2.2/
│ └── wan_civitai_i2v.yaml # 模型配置文件
└── LICENSE.txt# 安装依赖
pip install diffusers peft accelerate safetensors omegaconf -i https://pypi.tuna.tsinghua.edu.cn/simple/
# 安装 VideoX-Fun(提供模型组件和 pipeline)
pip install videox_fun -i https://pypi.tuna.tsinghua.edu.cn/simple/
# 或从源码安装:
# git clone https://github.com/aigc-apps/VideoX-Fun.git
# cd VideoX-Fun && pip install -e .
# 设置 NPU 环境变量
export TASK_QUEUE_ENABLE=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True# 下载基础模型(约 28GB,从 ModelScope)
# modelscope download --model PAI/Wan2.2-Fun-A14B-InP --local_dir models/Diffusion_Transformer/Wan2.2-Fun-A14B-InP
# 下载 LoRA 权重(从 HuggingFace 或 gitcode 镜像)
# git lfs install
# git clone https://huggingface.co/alibaba-pai/Wan2.2-Fun-Reward-LoRAspython predict_t2v_npu.py脚本中的关键配置项(在脚本顶部修改):
# 模型路径
model_name = "models/Diffusion_Transformer/Wan2.2-Fun-A14B-InP"
# Reward LoRA 权重路径
lora_path = "path/to/Wan2.2-Fun-A14B-InP-low-noise-HPS2.1.safetensors" # 低噪声 LoRA
lora_high_path = "path/to/Wan2.2-Fun-A14B-InP-high-noise-HPS2.1.safetensors" # 高噪声 LoRA
lora_weight = 0.55 # LoRA 权重倍率(推荐 0.5~0.55)
lora_high_weight = 0.55
# 生成参数
prompt = "一只棕色的狗摇着头..."
sample_size = [480, 832]
video_length = 81
num_inference_steps = 50
guidance_scale = 6.0predict_t2v_npu.py 相对于原始 VideoX-Fun predict_t2v.py 的关键适配点:
| 适配项 | 原始(GPU) | NPU 适配 |
|---|---|---|
| 设备初始化 | torch.cuda / set_multi_gpus_devices | 新增 get_npu_device() 函数,优先 NPU,自动回退 CUDA/CPU |
| 环境变量 | 无 | TASK_QUEUE_ENABLE=1,PYTORCH_NPU_ALLOC_CONF=expandable_segments:True |
| 随机数生成器 | torch.Generator(device=device) | 适配 NPU 设备字符串,增加 fallback 到 CPU |
| LoRA 合并/解除合并 | from videox_fun.utils.lora_utils import merge_lora | 内联实现,避免 NPU-only 环境下的导入问题 |
| 内存模式 | 支持 GPU offload 模式 | 相同语义,device 指向 NPU |
| 设备检测 | torch.cuda.is_available() | torch.npu.is_available() 优先 |
torch_npu 和 CANN 驱动,参考华为昇腾官方文档