我们推出了OREAL-7B和OREAL-32B数学推理模型系列,该系列采用基于结果奖励的强化学习(Outcome REwArd-based reinforcement Learning,OREAL)进行训练。这是一种新颖的强化学习框架,专为仅有二元结果奖励的任务而设计。
借助OREAL,一个7B模型在MATH-500数据集上实现了94.0 pass@1的准确率,性能与以往的32B模型相当。OREAL-32B进一步超越了以往通过蒸馏训练的32B模型,在MATH-500上达到了95.0 pass@1的准确率。

我们的方法利用N选优(BoN)采样进行行为克隆,并重塑负样本奖励以确保梯度一致性。此外,为应对长思维链推理中奖励稀疏的挑战,我们引入了一个在线策略的token级奖励模型,该模型能够识别推理轨迹中的关键token以进行重要性采样。更多详情,请参阅我们的论文。
| 模型 | MATH-500 | AIME2024 | AIME2025-I | LiveMath | Olympiad |
|---|---|---|---|---|---|
| API模型 | |||||
| GPT-4o-1120 | 72.8 | 16.7 | 13.3 | 44.8 | 33.7 |
| Claude-3.5-Sonnet-1022 | 78.3 | 13.3 | 3.3 | 46.7 | 35.4 |
| OpenAI-o1-preview | 85.5 | 44.6 | 40.0 | 71.0 | 43.6 |
| OpenAI-o1-mini | 90.0 | 56.6 | 46.7 | 74.4 | 46.3 |
| 7B模型 | |||||
| Qwen2.5-Instrust-7B | 76.6 | 13.3 | 0.0 | 37.0 | 29.1 |
| Qwen2.5-Math-Instrust-7B | 81.8 | 20.0 | 13.3 | 44.1 | 31.1 |
| rStar-Math-7B | 78.4* | 26.7* | - | - | 47.1* |
| Qwen2.5-7B-SimpleRL | 82.4* | 26.7* | - | - | 37.6* |
| Eurus-2-7B-PRIME | 79.2* | 26.7* | - | - | 42.1* |
| DeepSeek-R1-Distill-Qwen-7B | 92.8* | 55.5* | 40.0 | 65.6 | 64.1 |
| OREAL-7B | 91.0 | 33.3 | 33.3 | 62.6 | 59.9 |
| OREAL-DSR1-Distill-Qwen-7B | 94.0 | 50.0 | 40.0 | 65.6 | 66.1 |
| 32B模型 | |||||
| Qwen2.5-Instrust-32B | 80.6 | 20.0 | 13.3 | 50.8 | 40.4 |
| QwQ-32B-Preview | 90.6 | 50.0 | 40.0 | 72.7 | 58.5 |
| DeepSeek-R1-Distill-Qwen-32B | 94.3* | 72.6* | 46.7 | 67.7 | 71.2 |
| OREAL-32B | 95.0 | 60.0 | 46.7 | 74.8 | 72.4 |
注:OREAL和各基线模型的整体评估结果。
OREAL-DSR1-Distill-Qwen-7B表示通过OREAL训练的DeepSeek-R1-Distill-Qwen-7B。
AIME2025-I、LiveMath和Olympiad分别代表AIME 2025 Part1、LiveMathBench和OlympiadBench。
对于7B和32B参数规模的模型,我们使用粗体和斜体分别表示最佳和次佳性能。部分基线模型直接采用其报告中的结果,并标记为*。我们使用LMDeploy进行推理,使用OpenCompass评估模型性能。
我们不仅发布了RL模型,还发布了OREAL系列的SFT模型。希望这能为社区提供帮助,并推动数学推理强化学习(Math Reasoning RL)领域的研究发展。
| 模型 | 链接 |
|---|---|
| RL 模型 | |
| OREAL-7B | Hugging Face |
| OREAL-DSR1-Distill-Qwen-7B | Hugging Face |
| OREAL-32B | Hugging Face |
| SFT 模型 | |
| OREAL-7B-SFT | Hugging Face |
| OREAL-32B-SFT | Hugging Face |
我们还发布了在强化学习(RL)训练阶段使用的提示词。
| 数据集 | 链接 |
|---|---|
| RL 提示词 | Hugging Face |
OREAL-7B 和 OREAL-32B 在训练和测试阶段均使用系统提示词来引导模型进行推理。系统提示词如下:
system_prompt = "You are an expert mathematician with extensive experience in mathematical competitions. You approach problems through systematic thinking and rigorous reasoning. When solving problems, follow these thought processes:\n\n## Deep Understanding\nTake time to fully comprehend the problem before attempting a solution. Consider:\n- What is the real question being asked?\n- What are the given conditions and what do they tell us?\n- Are there any special restrictions or assumptions?\n- Which information is crucial and which is supplementary?\n\n## Multi-angle Analysis\nBefore solving, conduct thorough analysis:\n- What mathematical concepts and properties are involved?\n- Can you recall similar classic problems or solution methods?\n- Would diagrams or tables help visualize the problem?\n- Are there special cases that need separate consideration?\n\n## Systematic Thinking\nPlan your solution path:\n- Propose multiple possible approaches\n- Analyze the feasibility and merits of each method\n- Choose the most appropriate method and explain why\n- Break complex problems into smaller, manageable steps\n\n## Rigorous Proof\nDuring the solution process:\n- Provide solid justification for each step\n- Include detailed proofs for key conclusions\n- Pay attention to logical connections\n- Be vigilant about potential oversights\n\n## Repeated Verification\nAfter completing your solution:\n- Verify your results satisfy all conditions\n- Check for overlooked special cases\n- Consider if the solution can be optimized or simplified\n- Review your reasoning process\n\nRemember:\n1. Take time to think thoroughly rather than rushing to an answer\n2. Rigorously prove each key conclusion\n3. Keep an open mind and try different approaches\n4. Summarize valuable problem-solving methods\n5. Maintain healthy skepticism and verify multiple times\n\nYour response should reflect deep mathematical understanding and precise logical thinking, making your solution path and reasoning clear to others.\n\nWhen you're ready, present your complete solution with:\n- Clear problem understanding\n- Detailed solution process\n- Key insights\n- Thorough verification\n\nFocus on clear, logical progression of ideas and thorough explanation of your mathematical reasoning. Provide answers in the same language as the user asking the question, repeat the final answer using a '\\boxed{}' without any units, you have [[8192]] tokens to complete the answer."对于OREAL-DSR1-Distill-Qwen-7B,我们使用其原始模型的默认对话模板。
这些模型的对话模板已在tokenizer_config.json文件中设置。使用tokenizer.apply_chat_template()函数来应用对话模板。
question = [{'role': 'user', 'content': 'What is the sum of the first 100 natural numbers?'}]
tokenizer.apply_chat_template(question, add_generation_prompt=True)如果您发现本研究成果对您的研究有所帮助,请考虑引用:
@article{lyu2025exploring,
title={Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning},
author={Lyu, Chengqi and Gao, Songyang and Gu, Yuzhe and Zhang, Wenwei and Gao, Jianfei and Liu, Kuikun and Wang, Ziyi and Li, Shuaibin and Zhao, Qian and Huang, Haian and others},
journal={arXiv preprint arXiv:2502.06781},
year={2025}
}