PrismAudio

ICLR 2026

PrismAudio 是首个将强化学习集成到视频转音频（V2A）生成中的框架，配备了专门的思维链（CoT）规划机制。在 ThinkSound 开创性的基于 CoT 的 V2A 框架基础上，PrismAudio 进一步将单步推理分解为四个专用 CoT 模块——语义、时间、美学和空间模块，每个模块都有针对性的奖励函数，实现了多维度强化学习优化，可同时提升所有感知维度的推理能力。

快速开始

有关完整的训练和推理细节，请参考 ThinkSound prismaudio 分支。

git clone -b prismaudio https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound

conda create -n prismaudio python=3.10
conda activate prismaudio
chmod +x scripts/PrismAudio/setup/build_env.sh
./scripts/PrismAudio/setup/build_env.sh

# Download pretrained weights to ckpts/
# From Hugging Face:  https://huggingface.co/FunAudioLLM/PrismAudio
# From ModelScope:    https://www.modelscope.cn/models/iic/PrismAudio
git lfs install
git clone https://huggingface.co/FunAudioLLM/PrismAudio ckpts

许可协议

本项目基于 MIT 许可协议发布。

注意： 代码、模型权重和数据集仅用于研究和教育目的。未经作者明确授权，不得用于商业用途。

引用

如果您在研究中发现 PrismAudio 有用，请考虑引用我们的论文：

@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
      title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing}, 
      author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
      year={2025},
      eprint={2506.21448},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2506.21448}, 
}

@misc{liu2025prismaudiodecomposedchainofthoughtsmultidimensional,
        title={PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation}, 
        author={Huadai Liu and Kaicheng Luo and Wen Wang and Qian Chen and Peiwen Sun and Rongjie Huang and Xiangang Li and Jieping Ye and Wei Xue},
        year={2025},
        eprint={2511.18833},
        archivePrefix={arXiv},
        primaryClass={cs.SD},
        url={https://arxiv.org/abs/2511.18833}, 
}

联系方式

如有任何问题或建议，欢迎随时提交 issue

快速开始

有关完整的训练和推理细节，请参考 ThinkSound prismaudio 分支。

git clone -b prismaudio https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound

conda create -n prismaudio python=3.10
conda activate prismaudio
chmod +x scripts/PrismAudio/setup/build_env.sh
./scripts/PrismAudio/setup/build_env.sh

# Download pretrained weights to ckpts/
# From Hugging Face:  https://huggingface.co/FunAudioLLM/PrismAudio
# From ModelScope:    https://www.modelscope.cn/models/iic/PrismAudio
git lfs install
git clone https://huggingface.co/FunAudioLLM/PrismAudio ckpts

引用

如果您在研究中发现 PrismAudio 有用，请考虑引用我们的论文：

@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
      title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing}, 
      author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
      year={2025},
      eprint={2506.21448},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2506.21448}, 
}

@misc{liu2025prismaudiodecomposedchainofthoughtsmultidimensional,
        title={PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation}, 
        author={Huadai Liu and Kaicheng Luo and Wen Wang and Qian Chen and Peiwen Sun and Rongjie Huang and Xiangang Li and Jieping Ye and Wei Xue},
        year={2025},
        eprint={2511.18833},
        archivePrefix={arXiv},
        primaryClass={cs.SD},
        url={https://arxiv.org/abs/2511.18833}, 
}