OLMo-2-0325-32B-Instruct:可用于执行聊天、数学解题、推理评估等多样化任务。该项目是OLMo-2 32B模型的指令微调版本，经Tülu 3数据集监督微调、DPO及RLVR训练，支持多任务处理，开源可访问中间训练 checkpoint。【此简介由AI生成】

OLMo 2 32B Instruct March 2025 是 OLMo-2 32B March 2025 模型的一个后训练变体。该模型在 Tülu 3 数据集的 OLMo 特定变体上进行了有监督微调，随后在此数据集上进行了 DPO 训练，并在此数据集上完成了最终的 RLVR 训练。 Tülu 3 旨在除聊天外，在多种任务（如 MATH、GSM8K 和 IFEval）上实现最先进的性能。更多详情，请查阅 OLMo 2 论文或 Tülu 3 论文！

OLMo 是一系列开放语言模型（Open Language Models），旨在推动语言模型科学的发展。这些模型在 Dolma 数据集上进行训练。我们将发布所有代码、检查点、日志以及相关的训练细节。

模型说明

模型类型： 一种在公开可用、合成和人工创建的混合数据集上训练的模型。
语言（自然语言处理）： 主要为英语
许可证： Apache 2.0
微调基础模型： allenai/OLMo-2-0325-32B-DPO

模型来源

项目页面： https://allenai.org/olmo
代码仓库：
- 核心仓库（训练、推理、微调等）：https://github.com/allenai/OLMo-core
- 评估代码：https://github.com/allenai/olmes
- 进一步微调代码：https://github.com/allenai/open-instruct
论文： https://arxiv.org/abs/2501.00656
演示： https://playground.allenai.org/

安装

OLMo 2 将在 Transformers 的下一个版本中得到支持，您需要通过以下命令从主分支安装：

pip install --upgrade git+https://github.com/huggingface/transformers.git

使用模型

通过 HuggingFace 加载

若要通过 HuggingFace 加载模型，请使用以下代码片段：

from transformers import AutoModelForCausalLM

olmo_model = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-0325-32B-Instruct")

聊天模板

注意：由于配置略有更改，此模板与之前的 OLMo 2 和 Tülu 3 模型不同。它在其余内容前没有 bos 令牌。我们的其他模型在聊天模板开头包含 <|endoftext|>。

我们模型的聊天模板格式如下：

<|user|>\nHow are you doing?\n<|assistant|>\nI'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>

或者展开为新行：

<|user|>
How are you doing?
<|assistant|>
I'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>

它也嵌入在分词器中，用于 tokenizer.apply_chat_template。

系统提示

在 Ai2 演示中，我们默认使用以下系统提示：

You are OLMo 2, a helpful and harmless AI Assistant built by the Allen Institute for AI.

该模型的训练并未预设特定的系统提示词。

中间检查点

为了便于研究强化学习微调，我们已发布了模型在RLVR训练过程中的中间检查点。模型权重每20个训练步骤保存一次，可在HuggingFace仓库的修订版本中获取。例如，您可以通过以下方式加载：

olmo_model = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-0325-32B-Instruct", revision="step_200")

偏见、风险与局限性

OLMo-2 模型的安全训练有限，且不像 ChatGPT 那样在部署时自动进行响应的闭环过滤，因此该模型可能会生成有问题的输出（尤其是在被提示这样做时）。

有关这一点的示例，请参见 Falcon 180B 模型卡片。

性能

模型	平均值	AlpacaEval 2 LC	BBH	DROP	GSM8k	IFEval	MATH	MMLU	Safety	PopQA	TruthQA
闭源 API 模型
GPT-3.5 Turbo 0125	59.6	38.7	66.6	70.2	74.3	66.9	41.2	70.2	69.1	45.0	62.9
GPT 4o Mini 2024-07-18	65.7	49.7	65.9	36.3	83.0	83.5	67.9	82.2	84.9	39.0	64.8
开源权重模型
Mistral-Nemo-Instruct-2407	50.9	45.8	54.6	23.6	81.4	64.5	31.9	70.0	52.7	26.9	57.7
Ministral-8B-Instruct	52.1	31.4	56.2	56.2	80.0	56.4	40.0	68.5	56.2	20.2	55.5
Gemma-2-27b-it	61.3	49.0	72.7	67.5	80.7	63.2	35.1	70.7	75.9	33.9	64.6
Qwen2.5-32B	66.5	39.1	82.3	48.3	87.5	82.4	77.9	84.7	82.4	26.1	70.6
Mistral-Small-24B	67.6	43.2	80.1	78.5	87.2	77.3	65.9	83.7	66.5	24.4	68.1
Llama-3.1-70B	70.0	32.9	83.0	77.0	94.5	88.0	56.2	85.2	76.4	46.5	66.8
Llama-3.3-70B	73.0	36.5	85.8	78.0	93.6	90.8	71.8	85.9	70.4	48.2	66.1
Gemma-3-27b-it	-	63.4	83.7	69.2	91.1	-	-	81.8	-	30.9	-
完全开源模型
OLMo-2-7B-1124-Instruct	55.7	31.0	48.5	58.9	85.2	75.6	31.3	63.9	81.2	24.6	56.3
OLMo-2-13B-1124-Instruct	61.4	37.5	58.4	72.1	87.4	80.4	39.7	68.6	77.5	28.8	63.9
OLMo-2-32B-0325-SFT	61.7	16.9	69.7	77.2	78.4	72.4	35.9	76.1	93.8	35.4	61.3
OLMo-2-32B-0325-DPO	68.8	44.1	70.2	77.5	85.7	83.8	46.8	78.0	91.9	36.4	73.5
OLMo-2-32B-0325-Instruct	68.8	42.8	70.6	78.0	87.6	85.6	49.7	77.3	85.9	37.5	73.2

学习曲线

以下是 allenai/OLMo-2-0325-32B-Instruct 的训练曲线。该模型使用 5 个 8xH100 节点进行训练。

以下是 allenai/OLMo-2-0325-32B-Instruct 在训练步骤中的核心评估分数（注意我们将步骤 320 作为最终检查点，对应 episode 573,440）：

以下是 allenai/OLMo-2-0325-32B-Instruct 在训练步骤中的其他评估分数：

复现命令

以下命令直接复制自跟踪的训练任务：

# clone and check out commit
git clone https://github.com/allenai/open-instruct.git
# this should be the correct commit, the main thing is to have the vllm monkey patch for
# 32b olmo https://github.com/allenai/open-instruct/blob/894ffa236319bc6c26c346240a7e4ee04ba0bd31/open_instruct/vllm_utils2.py#L37-L59
git checkout a51dc98525eec01de6e8a24c071f42dce407d738
uv sync
uv sync --extra compile

# note that you may need 5 8xH100 nodes for the training.
# so please setup ray properly, e.g., https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md#llama-31-tulu-3-70b-reproduction
python open_instruct/grpo_vllm_thread_ray_gtrl.py \
    --exp_name 0310_olmo2_32b_grpo_12818 \
    --beta 0.01 \
    --local_mini_batch_size 32 \
    --number_samples_per_prompt 16 \
    --output_dir output \
    --local_rollout_batch_size 4 \
    --kl_estimator kl3 \
    --learning_rate 5e-7 \
    --dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0 \
    --dataset_mixer_list_splits train \
    --dataset_mixer_eval_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 16 \
    --dataset_mixer_eval_list_splits train \
    --max_token_length 2048 \
    --max_prompt_token_length 2048 \
    --response_length 2048 \
    --model_name_or_path allenai/OLMo-2-0325-32B-DPO \
    --non_stop_penalty \
    --stop_token eos \
    --temperature 1.0 \
    --ground_truths_key ground_truth \
    --chat_template_name tulu \
    --sft_messages_key messages \
    --eval_max_length 4096 \
    --total_episodes 10000000 \
    --penalty_reward_value 0.0 \
    --deepspeed_stage 3 \
    --no_gather_whole_model \
    --per_device_train_batch_size 2 \
    --local_rollout_forward_batch_size 2 \
    --actor_num_gpus_per_node 8 8 8 4 \
    --num_epochs 1 \
    --vllm_tensor_parallel_size 1 \
    --vllm_num_engines 12 \
    --lr_scheduler_type constant \
    --apply_verifiable_reward true \
    --seed 1 \
    --num_evals 30 \
    --save_freq 20 \
    --reward_model_multiplier 0.0 \
    --no_try_launch_beaker_eval_jobs \
    --try_launch_beaker_eval_jobs_on_weka \
    --gradient_checkpointing \
    --with_tracking

许可与使用

OLMo 2 采用 Apache 2.0 许可协议。 OLMo 2 旨在用于研究和教育用途。欲了解更多信息，请参阅我们的负责任使用指南。本模型使用包含第三方模型生成输出的混合数据集进行了微调，因此需遵守额外条款：Gemma 使用条款。

引用

@article{olmo20242olmo2furious,
      title={2 OLMo 2 Furious}, 
      author={Team OLMo and Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Michal Guerquin and Hamish Ivison and Pang Wei Koh and Jiacheng Liu and Saumya Malik and William Merrill and Lester James V. Miranda and Jacob Morrison and Tyler Murray and Crystal Nam and Valentina Pyatkin and Aman Rangapur and Michael Schmitz and Sam Skjonsberg and David Wadden and Christopher Wilhelm and Michael Wilson and Luke Zettlemoyer and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
      year={2024},
      eprint={2501.00656},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.00656}, 
}