HuggingFace镜像/PrismAudio
模型介绍文件和版本分析
下载使用量0

PrismAudio

ICLR 2026

arXiv   Online Demo   GitHub   Hugging Face   ModelScope


PrismAudio 是首个将强化学习集成到视频转音频(V2A)生成中的框架,配备了专门的思维链(CoT)规划机制。在 ThinkSound 开创性的基于 CoT 的 V2A 框架基础上,PrismAudio 进一步将单步推理分解为四个专用 CoT 模块——语义、时间、美学和空间模块,每个模块都有针对性的奖励函数,实现了多维度强化学习优化,可同时提升所有感知维度的推理能力。

快速开始

有关完整的训练和推理细节,请参考 ThinkSound prismaudio 分支。

git clone -b prismaudio https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound

conda create -n prismaudio python=3.10
conda activate prismaudio
chmod +x scripts/PrismAudio/setup/build_env.sh
./scripts/PrismAudio/setup/build_env.sh

# Download pretrained weights to ckpts/
# From Hugging Face:  https://huggingface.co/FunAudioLLM/PrismAudio
# From ModelScope:    https://www.modelscope.cn/models/iic/PrismAudio
git lfs install
git clone https://huggingface.co/FunAudioLLM/PrismAudio ckpts

许可协议

本项目基于 MIT 许可协议 发布。

注意: 代码、模型权重和数据集仅用于研究和教育目的。未经作者明确授权,不得用于商业用途。


引用

如果您在研究中发现 PrismAudio 有用,请考虑引用我们的论文:

@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
      title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing}, 
      author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
      year={2025},
      eprint={2506.21448},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2506.21448}, 
}

@misc{liu2025prismaudiodecomposedchainofthoughtsmultidimensional,
        title={PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation}, 
        author={Huadai Liu and Kaicheng Luo and Wen Wang and Qian Chen and Peiwen Sun and Rongjie Huang and Xiangang Li and Jieping Ye and Wei Xue},
        year={2025},
        eprint={2511.18833},
        archivePrefix={arXiv},
        primaryClass={cs.SD},
        url={https://arxiv.org/abs/2511.18833}, 
}

联系方式

如有任何问题或建议,欢迎随时提交 issue