定制化视频生成旨在根据用户灵活定义的条件,生成包含特定主体的视频。然而,现有方法在主体一致性和输入模态多样性方面往往存在不足。本文提出了 HunyuanCustom,一种多模态定制化视频生成框架,该框架在支持图像、音频、视频和文本等多种条件输入的同时,着重强调主体一致性。HunyuanCustom 基于 HunyuanVideo 构建,首先针对图像-文本条件生成任务进行优化:引入基于 LLaVA 的文本-图像融合模块以增强多模态理解能力,并设计图像 ID 增强模块,通过时序拼接来强化跨帧的主体特征。为了支持音频和视频条件生成,我们进一步提出了模态特定条件注入机制:AudioNet 模块通过空间交叉注意力实现层级对齐;视频驱动注入模块则通过基于分块(patchify)的特征对齐网络,整合经 latent 压缩的条件视频。在单主体和多主体场景下的大量实验表明,HunyuanCustom 在 ID 一致性、真实感和文本-视频对齐方面显著优于当前最先进的开源和闭源方法。此外,我们还验证了其在音频驱动和视频驱动的定制化视频生成等下游任务中的鲁棒性。我们的研究结果表明,多模态条件控制和主体保留策略在推动可控视频生成领域发展方面具有显著成效。

我们提出了HunyuanCustom,这是一个以主体一致性为核心的多模态、条件可控生成模型,它构建于 Hunyuan Video 生成框架之上。该模型能够基于文本、图像、音频和视频等输入,生成具有主体一致性的视频。
HunyuanCustom 支持文本、图像、音频和视频等多种形式的输入。
具体而言,它可以处理单张或多张图像输入,实现单个或多个主体的定制化视频生成。
此外,它还能融入额外的音频输入,驱动主体说出相应的音频内容。
最后,HunyuanCustom 支持视频输入,可将视频中的指定对象替换为给定图像中的主体。

借助 HunyuanCustom 的多模态能力,可以完成众多下游任务。
例如,通过输入多张图像,HunyuanCustom 可助力实现虚拟人广告和虚拟试穿。此外,
结合图像和音频输入,它能够创建唱歌 avatar。再者,利用一张图像和一段视频作为输入,
HunyuanCustom 支持视频编辑功能,将视频中的主体替换为提供图像中的主体。
更多应用场景有待您的探索!

为评估 HunyuanCustom 的性能,我们将其与当前主流的视频定制方法进行了对比, 包括 VACE、Skyreels、Pika、Vidu、可灵(Keling)和海洛(Hailuo)。对比重点包括面部/主体一致性、 视频-文本对齐度以及整体视频质量。
| 模型 | 面部相似度(Face-Sim) | CLIP-B-T | DINO 相似度(DINO-Sim) | 时间一致性(Temp-Consis) | 视频质量(DD) |
|---|---|---|---|---|---|
| VACE-1.3B | 0.204 | 0.308 | 0.569 | 0.967 | 0.53 |
| Skyreels | 0.402 | 0.295 | 0.579 | 0.942 | 0.72 |
| Pika | 0.363 | 0.305 | 0.485 | 0.928 | 0.89 |
| Vidu2.0 | 0.424 | 0.300 | 0.537 | 0.961 | 0.43 |
| 可灵1.6(Keling1.6) | 0.505 | 0.285 | 0.580 | 0.914 | 0.78 |
| 海洛(Hailuo) | 0.526 | 0.314 | 0.433 | 0.937 | 0.94 |
| HunyuanCustom(我们的模型) | 0.627 | 0.306 | 0.593 | 0.958 | 0.71 |
下表展示了运行HunyuanCustom模型(批处理大小=1)生成视频的要求:
| 模型 | 设置 (高度/宽度/帧数) | GPU 峰值内存 |
|---|---|---|
| HunyuanCustom | 720px1280px129f | 80GB |
| HunyuanCustom | 512px896px129f | 60GB |
首先克隆仓库:
git clone https://github.com/Tencent/HunyuanCustom.git
cd HunyuanCustom手动安装建议使用 CUDA 12.4 或 11.8 版本。
Conda 的安装说明可在 此处 获取。
# 1. Create conda environment
conda create -n HunyuanCustom python==3.10.9
# 2. Activate the environment
conda activate HunyuanCustom
# 3. Install PyTorch and other dependencies using conda
# For CUDA 11.8
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia
# For CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
# 4. Install pip dependencies
python -m pip install -r requirements.txt
# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3若在特定 GPU 型号上运行时遇到浮点异常(核心转储),您可以尝试以下解决方案:
# Option 1: Making sure you have installed CUDA 12.4, CUBLAS>=12.4.5.8, and CUDNN>=9.00 (or simply using our CUDA 12 docker image).
pip install nvidia-cublas-cu12==12.4.5.8
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
# Option 2: Forcing to explicitly use the CUDA 11.8 compiled version of Pytorch and all the other packages
pip uninstall -r requirements.txt # uninstall all packages
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install ninja
pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3此外,您也可以使用HunyuanVideo Docker镜像。请使用以下命令拉取并运行该Docker镜像。
# For CUDA 12.4 (updated to avoid float point exception)
docker pull hunyuanvideo/hunyuanvideo:cuda_12
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2
# For CUDA 11.8
docker pull hunyuanvideo/hunyuanvideo:cuda_11
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2预训练模型的下载详情请参见此处。
例如,若要使用8块GPU生成视频,可执行以下命令:
cd HunyuanCustom
export MODEL_BASE="./models"
export PYTHONPATH=./
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
--ref-image './assets/images/seg_woman_01.png' \
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states.pt" \
--video-size 720 1280 \
--seed 1024 \
--sample-n-frames 129 \
--infer-steps 30 \
--flow-shift-eval-video 13.0 \
--save-path './results/sp_720p'cd HunyuanCustom
export MODEL_BASE="./models"
export PYTHONPATH=./
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
--ref-image './assets/images/sed_red_panda.png' \
--input-video './assets/input_videos/001_bg.mp4' \
--mask-video './assets/input_videos/001_mask.mp4' \
--expand-scale 5 \
--video-condition \
--pos-prompt "Realistic, High-quality. A red panda is walking on a stone road." \
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
--ckpt ${MODEL_BASE}"/hunyuancustom_editing_720P/mp_rank_00_model_states.pt" \
--seed 1024 \
--infer-steps 50 \
--flow-shift-eval-video 5.0 \
--save-path './results/sp_editing_720p'
# --pose-enhance # Enable for human videos to improve pose generation quality.cd HunyuanCustom
export MODEL_BASE="./models"
export PYTHONPATH=./
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
--ref-image './assets/images/seg_man_01.png' \
--input-audio './assets/audios/milk_man.mp3' \
--audio-strength 0.8 \
--audio-condition \
--pos-prompt "Realistic, High-quality. In the study, a man sits at a table featuring a bottle of milk while delivering a product presentation." \
--neg-prompt "Two people, two persons, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
--ckpt ${MODEL_BASE}"/hunyuancustom_audio_720P/mp_rank_00_model_states.pt" \
--seed 1026 \
--video-size 720 1280 \
--sample-n-frames 129 \
--cfg-scale 7.5 \
--infer-steps 30 \
--use-deepcache 1 \
--flow-shift-eval-video 13.0 \
--save-path './results/sp_audio_720p'例如,要使用1块GPU生成视频,您可以使用以下命令:
cd HunyuanCustom
export MODEL_BASE="./models"
export DISABLE_SP=1
export PYTHONPATH=./
python hymm_sp/sample_gpu_poor.py \
--ref-image './assets/images/seg_woman_01.png' \
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
--video-size 512 896 \
--seed 1024 \
--sample-n-frames 129 \
--infer-steps 30 \
--flow-shift-eval-video 13.0 \
--save-path './results/1gpu_540p' \
--use-fp8cd HunyuanCustom
export MODEL_BASE="./models"
export CPU_OFFLOAD=1
export PYTHONPATH=./
python hymm_sp/sample_gpu_poor.py \
--ref-image './assets/images/seg_woman_01.png' \
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
--video-size 720 1280 \
--seed 1024 \
--sample-n-frames 129 \
--infer-steps 30 \
--flow-shift-eval-video 13.0 \
--save-path './results/cpu_720p' \
--use-fp8 \
--cpu-offload cd HunyuanCustom
# Single-Subject Video Customization
bash ./scripts/run_gradio.sh
# Video-Driven Video Customization
bash ./scripts/run_gradio.sh --video
# Audio-Driven Video Customization
bash ./scripts/run_gradio.sh --audio如果您发现 HunyuanCustom 对您的研究和应用有所帮助,请使用以下 BibTeX 进行引用:
@misc{hu2025hunyuancustom,
title={HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation},
author={Teng Hu and Zhentao Yu and Zhengguang Zhou and Sen Liang and Yuan Zhou and Qin Lin and Qinglin Lu},
year={2025},
eprint={2505.04512},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.04512},
}感谢 HunyuanVideo、HunyuanVideo-Avatar、MimicMotion、SD3、FLUX、Llama、LLaVA、Xtuner、diffusers 和 HuggingFace 等项目的贡献者们所做的开放研究与探索。