DeepSeek-Prover-V1.5：利用证明助手反馈进行强化学习和蒙特卡洛树搜索

1. 引言

我们介绍了DeepSeek-Prover-V1.5，这是一个开源的语言模型，专为Lean 4的定理证明设计，通过优化训练和推理过程来增强DeepSeek-Prover-V1。该模型在DeepSeekMath-Base上进行预训练，专注于形式数学语言，然后使用从DeepSeek-Prover-V1派生的增强形式定理证明数据集进行监督微调。通过从证明助手反馈中学习强化学习（RLPAF），模型得到进一步改进。除了DeepSeek-Prover-V1的单次完整证明生成方法外，我们还提出了RMaxTS，这是一种蒙特卡洛树搜索的变体，采用内在奖励驱动的探索策略来生成多样化的证明路径。DeepSeek-Prover-V1.5在高中级别miniF2F基准测试集（63.5%）和本科级别ProofNet基准测试集（25.3%）上取得了新的最先进结果。

2. 评估结果

	miniF2F-test	ProofNet
ReProver	26.5%	13.8%
GPT-f	36.6%	-
Hypertree Proof Search	41.0%	-
InternLM2-StepProver	54.5%	18.1%
DeepSeek-Prover-V1	50.0%	-
DeepSeek-Prover-V1.5-Base	42.2%	13.2%
DeepSeek-Prover-V1.5-SFT	57.4%	22.9%
DeepSeek-Prover-V1.5-RL	60.2%	22.6%
DeepSeek-Prover-V1.5-RL + RMaxTS	63.5%	25.3%

3. 模型下载

我们向公众发布了具有70亿参数的DeepSeek-Prover-V1.5，包括基础模型、SFT模型和RL模型。

模型	下载
DeepSeek-Prover-V1.5-Base	🤗 HuggingFace
DeepSeek-Prover-V1.5-SFT	🤗 HuggingFace
DeepSeek-Prover-V1.5-RL	🤗 HuggingFace

4. 许可

此代码库遵循MIT许可。DeepSeekMath模型的使用受模型许可的约束。DeepSeekMath支持商业用途。

查看LICENSE-CODE和LICENSE-MODEL了解更多详情。

5. 引用

@article{xin2024deepseekproverv15harnessingproofassistant,
      title={DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search}, 
      author={Huajian Xin and Z. Z. Ren and Junxiao Song and Zhihong Shao and Wanjia Zhao and Haocheng Wang and Bo Liu and Liyue Zhang and Xuan Lu and Qiushi Du and Wenjun Gao and Qihao Zhu and Dejian Yang and Zhibin Gou and Z. F. Wu and Fuli Luo and Chong Ruan},
      year={2024},
      eprint={2408.08152},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.08152}, 
}

6. 联系我们

如果您有任何疑问，请创建一个反馈问题或通过 service@deepseek.com 与我们联系。

1. 引言

2. 评估结果

	miniF2F-test	ProofNet
ReProver	26.5%	13.8%
GPT-f	36.6%	-
Hypertree Proof Search	41.0%	-
InternLM2-StepProver	54.5%	18.1%
DeepSeek-Prover-V1	50.0%	-
DeepSeek-Prover-V1.5-Base	42.2%	13.2%
DeepSeek-Prover-V1.5-SFT	57.4%	22.9%
DeepSeek-Prover-V1.5-RL	60.2%	22.6%
DeepSeek-Prover-V1.5-RL + RMaxTS	63.5%	25.3%

模型

下载

DeepSeek-Prover-V1.5-Base

🤗 HuggingFace

DeepSeek-Prover-V1.5-SFT

🤗 HuggingFace

DeepSeek-Prover-V1.5-RL

🤗 HuggingFace

5. 引用

@article{xin2024deepseekproverv15harnessingproofassistant,
      title={DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search}, 
      author={Huajian Xin and Z. Z. Ren and Junxiao Song and Zhihong Shao and Wanjia Zhao and Haocheng Wang and Bo Liu and Liyue Zhang and Xuan Lu and Qiushi Du and Wenjun Gao and Qihao Zhu and Dejian Yang and Zhibin Gou and Z. F. Wu and Fuli Luo and Chong Ruan},
      year={2024},
      eprint={2408.08152},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.08152}, 
}