模型详情

OLMo 2 13B 模型卡片

我们推出 OLMo 2，这是一个全新系列的模型，包含 7B 和 13B 两种规模，训练数据量高达 5T tokens。这些模型与同等规模的全开放模型性能相当或更优，在英语学术基准测试中，与 Meta 和 Mistral 的开源权重模型相比也具有竞争力。

OLMo 是一系列开源语言模型（Open Language Models），旨在推动语言模型科学的发展。这些模型基于 Dolma 数据集进行训练。我们将发布所有代码、检查点、日志（即将推出）以及相关的训练细节。本批次发布的核心模型包括以下内容：

规模	训练 tokens	层数	隐藏层大小	注意力头数	上下文长度
OLMo 2 7B	4 万亿	32	4096	32	4096
OLMo 2 13B	5 万亿	40	5120	40	4096

本批次发布的核心模型包括以下内容：

阶段	OLMo 2 7B	OLMo 2 13B
基础模型	allenai/OLMo-2-1124-7B	allenai/OLMo-2-1124-13B
有监督微调（SFT）	allenai/OLMo-2-1124-7B-SFT	allenai/OLMo-2-1124-13B-SFT
直接偏好优化（DPO）	allenai/OLMo-2-1124-7B-DPO	allenai/OLMo-2-1124-13B-DPO
最终模型（RLVR）	allenai/OLMo-2-1124-7B-Instruct	allenai/OLMo-2-1124-13B-Instruct
奖励模型（RM）	allenai/OLMo-2-1124-7B-RM	(与 7B 相同)

安装

OLMo 2 将在 Transformers 的下一个版本中得到支持，您需要通过以下命令从主分支进行安装：

pip install --upgrade git+https://github.com/huggingface/transformers.git

推理

您可以使用标准的 HuggingFace transformers 库来运行 OLMo：

from transformers import AutoModelForCausalLM, AutoTokenizer
olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B")
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-13B")
message = ["Language modeling is "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
# optional verifying cuda
# inputs = {k: v.to('cuda') for k,v in inputs.items()}
# olmo = olmo.to('cuda')
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
>> 'Language modeling is  a key component of any text-based application, but its effectiveness...'

为获得更快的性能，您可以使用以下方法对模型进行量化：

AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B", 
    torch_dtype=torch.float16, 
    load_in_8bit=True)  # Requires bitsandbytes

量化模型对数据类型和 CUDA 操作更为敏感。为避免潜在问题，建议使用以下方式将输入直接传递到 CUDA：

inputs.input_ids.to('cuda')

我们已发布这些模型的检查点。对于预训练，命名规范为stepXXX-tokensYYYB。对于包含混合成分的检查点，命名规范为stage2-ingredientN-stepXXX-tokensYYYB

要使用HuggingFace加载特定的模型版本，只需添加参数revision：

olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B", revision="step102500-tokens860B")

或者，您可以通过以下代码片段访问模型的所有修订版本：

from huggingface_hub import list_repo_refs
out = list_repo_refs("allenai/OLMo-2-1124-13B")
branches = [b.name for b in out.branches]

微调

模型微调可基于最终检查点（本模型的 main 修订版本）或多个中间检查点进行。目前提供两种微调方案。

使用 OLMo 代码库进行微调：

torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \
    --data.paths=[{path_to_data}/input_ids.npy] \
    --data.label_mask_paths=[{path_to_data}/label_mask.npy] \
    --load_path={path_to_checkpoint} \
    --reset_trainer_state

如需更多文档，请参阅 GitHub 自述文件。

AI2 的 Open Instruct 代码库正在开发进一步的微调支持。详情请见此处。

模型说明

开发机构： Allen Institute for AI (Ai2)
模型类型： Transformer 风格的自回归语言模型。
支持语言（自然语言处理）： 英语
许可证： 代码和模型基于 Apache 2.0 许可证发布。
联系方式： 技术咨询：olmo@allenai.org。媒体联系：press@allenai.org
数据截止日期： 2023 年 12 月。

模型来源

项目页面： https://allenai.org/olmo
代码库：
- 核心代码库（训练、推理、微调等）：https://github.com/allenai/OLMo
- 评估代码：https://github.com/allenai/OLMo-Eval
- 进一步微调代码：https://github.com/allenai/open-instruct
论文： https://arxiv.org/abs/2501.00656

评估

OLMo 2 7B 和 13B 模型的核心模型结果如下。

模型	训练 FLOPs	平均值	ARC/C	HSwag	WinoG	MMLU	DROP	NQ	AGIEval	GSM8k	MMLUPro	TriviaQA
开源权重模型：
Llama-2-13B	1.6·10²³	54.1	67.3	83.9	74.9	55.7	45.6	38.4	41.5	28.1	23.9	81.3
Mistral-7B-v0.3	n/a	58.8	78.3	83.1	77.7	63.5	51.8	37.2	47.3	40.1	30	79.3
Llama-3.1-8B	7.2·10²³	61.8	79.5	81.6	76.6	66.9	56.4	33.9	51.3	56.5	34.7	80.3
Mistral-Nemo-12B	n/a	66.9	85.2	85.6	81.5	69.5	69.2	39.7	54.7	62.1	36.7	84.6
Qwen-2.5-7B	8.2·10²³	67.4	89.5	89.7	74.2	74.4	55.8	29.9	63.7	81.5	45.8	69.4
Gemma-2-9B	4.4·10²³	67.8	89.5	87.3	78.8	70.6	63	38	57.3	70.1	42	81.8
Qwen-2.5-14B	16.0·10²³	72.2	94	94	80	79.3	51.5	37.3	71	83.4	52.8	79.1
部分开源模型：
StableLM-2-12B	2.9·10²³	62.2	81.9	84.5	77.7	62.4	55.5	37.6	50.9	62	29.3	79.9
Zamba-2-7B	n/c	65.2	92.2	89.4	79.6	68.5	51.7	36.5	55.5	67.2	32.8	78.8
完全开源模型：
Amber-7B	0.5·10²³	35.2	44.9	74.5	65.5	24.7	26.1	18.7	21.8	4.8	11.7	59.3
OLMo-7B	1.0·10²³	38.3	46.4	78.1	68.5	28.3	27.3	24.8	23.7	9.2	12.1	64.1
MAP-Neo-7B	2.1·10²³	49.6	78.4	72.8	69.2	58	39.4	28.9	45.8	12.5	25.9	65.1
OLMo-0424-7B	0.9·10²³	50.7	66.9	80.1	73.6	54.3	50	29.6	43.9	27.7	22.1	58.8
DCLM-7B	1.0·10²³	56.9	79.8	82.3	77.3	64.4	39.3	28.8	47.5	46.1	31.3	72.1
OLMo-2-1124-7B	1.8·10²³	62.9	79.8	83.8	77.2	63.7	60.8	36.9	50.4	67.5	31	78
OLMo-2-1124-13B	4.6·10²³	68.3	83.5	86.4	81.5	67.5	70.7	46.7	54.2	75.1	35.1	81.9

模型详情

预训练

	OLMo 2 7B	OLMo 2 13B
预训练阶段 1 （OLMo-Mix-1124）	4 万亿 tokens （1 个 epoch）	5 万亿 tokens （1.2 个 epochs）
预训练阶段 2 （Dolmino-Mix-1124）	500 亿 tokens（3 次运行）已合并	1000 亿 tokens（3 次运行） 3000 亿 tokens（1 次运行）已合并
后训练（Tulu 3 SFT OLMo mix）	SFT + DPO + PPO （偏好混合数据）	SFT + DPO + PPO （偏好混合数据）

阶段 1：初始预训练

数据集：OLMo-Mix-1124（3.9 万亿 tokens）
占比：总预训练预算的 90% 以上
7B 模型：约 1 个 epoch
13B 模型：1.2 个 epochs（5 万亿 tokens）

阶段 2：微调

数据集：Dolmino-Mix-1124（8430 亿 tokens）
三种训练混合数据：
- 500 亿 tokens
- 1000 亿 tokens
- 3000 亿 tokens
混合数据构成：50% 高质量数据 + 学术/问答/指令/数学内容

模型合并

7B 模型：3 个基于 500 亿混合数据训练的版本，通过 model souping 方法合并
13B 模型：3 个基于 1000 亿混合数据训练的版本 + 1 个基于 3000 亿混合数据训练的版本，合并为最终检查点

偏差、风险与局限性

与任何基础语言模型或未经安全过滤的微调模型一样，这些模型很容易被用户提示生成有害和敏感内容。此类内容也可能在无意中产生，尤其是在涉及偏差的情况下，因此我们建议用户在应用此技术时考虑相关风险。此外，OLMo 或任何大型语言模型（LLM）的许多陈述往往不准确，因此需要核实事实。

许可与使用

OLMo 2 采用 Apache 2.0 许可协议。 OLMo 2 旨在用于研究和教育目的。有关更多信息，请参阅我们的负责任使用指南。

引用格式

@misc{olmo20242olmo2furious,
      title={2 OLMo 2 Furious}, 
      author={Team OLMo and Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Michal Guerquin and Hamish Ivison and Pang Wei Koh and Jiacheng Liu and Saumya Malik and William Merrill and Lester James V. Miranda and Jacob Morrison and Tyler Murray and Crystal Nam and Valentina Pyatkin and Aman Rangapur and Michael Schmitz and Sam Skjonsberg and David Wadden and Christopher Wilhelm and Michael Wilson and Luke Zettlemoyer and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
      year={2024},
      eprint={2501.00656},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.00656}, 
}

模型卡片联系方式

如本模型卡片存在错误，请联系 olmo@allenai.org。

模型详情

OLMo 2 13B 模型卡片

规模	训练 tokens	层数	隐藏层大小	注意力头数	上下文长度
OLMo 2 7B	4 万亿	32	4096	32	4096
OLMo 2 13B	5 万亿	40	5120	40	4096

本批次发布的核心模型包括以下内容：

阶段	OLMo 2 7B	OLMo 2 13B
基础模型	allenai/OLMo-2-1124-7B	allenai/OLMo-2-1124-13B
有监督微调（SFT）	allenai/OLMo-2-1124-7B-SFT	allenai/OLMo-2-1124-13B-SFT
直接偏好优化（DPO）	allenai/OLMo-2-1124-7B-DPO	allenai/OLMo-2-1124-13B-DPO
最终模型（RLVR）	allenai/OLMo-2-1124-7B-Instruct	allenai/OLMo-2-1124-13B-Instruct
奖励模型（RM）	allenai/OLMo-2-1124-7B-RM	(与 7B 相同)

安装

OLMo 2 将在 Transformers 的下一个版本中得到支持，您需要通过以下命令从主分支进行安装：

pip install --upgrade git+https://github.com/huggingface/transformers.git

推理

您可以使用标准的 HuggingFace transformers 库来运行 OLMo：

from transformers import AutoModelForCausalLM, AutoTokenizer
olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B")
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-13B")
message = ["Language modeling is "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
# optional verifying cuda
# inputs = {k: v.to('cuda') for k,v in inputs.items()}
# olmo = olmo.to('cuda')
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
>> 'Language modeling is  a key component of any text-based application, but its effectiveness...'

为获得更快的性能，您可以使用以下方法对模型进行量化：

AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B", 
    torch_dtype=torch.float16, 
    load_in_8bit=True)  # Requires bitsandbytes

量化模型对数据类型和 CUDA 操作更为敏感。为避免潜在问题，建议使用以下方式将输入直接传递到 CUDA：

inputs.input_ids.to('cuda')

我们已发布这些模型的检查点。对于预训练，命名规范为stepXXX-tokensYYYB。对于包含混合成分的检查点，命名规范为stage2-ingredientN-stepXXX-tokensYYYB

要使用HuggingFace加载特定的模型版本，只需添加参数revision：

olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B", revision="step102500-tokens860B")

或者，您可以通过以下代码片段访问模型的所有修订版本：

from huggingface_hub import list_repo_refs
out = list_repo_refs("allenai/OLMo-2-1124-13B")
branches = [b.name for b in out.branches]

微调

模型微调可基于最终检查点（本模型的 main 修订版本）或多个中间检查点进行。目前提供两种微调方案。

使用 OLMo 代码库进行微调：

torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \
    --data.paths=[{path_to_data}/input_ids.npy] \
    --data.label_mask_paths=[{path_to_data}/label_mask.npy] \
    --load_path={path_to_checkpoint} \
    --reset_trainer_state

如需更多文档，请参阅 GitHub 自述文件。

AI2 的 Open Instruct 代码库正在开发进一步的微调支持。详情请见此处。

模型说明

开发机构： Allen Institute for AI (Ai2)
模型类型： Transformer 风格的自回归语言模型。
支持语言（自然语言处理）： 英语
许可证： 代码和模型基于 Apache 2.0 许可证发布。
联系方式： 技术咨询：olmo@allenai.org。媒体联系：press@allenai.org
数据截止日期： 2023 年 12 月。

模型来源

项目页面： https://allenai.org/olmo
代码库：
- 核心代码库（训练、推理、微调等）：https://github.com/allenai/OLMo
- 评估代码：https://github.com/allenai/OLMo-Eval
- 进一步微调代码：https://github.com/allenai/open-instruct
论文： https://arxiv.org/abs/2501.00656

评估

OLMo 2 7B 和 13B 模型的核心模型结果如下。

模型	训练 FLOPs	平均值	ARC/C	HSwag	WinoG	MMLU	DROP	NQ	AGIEval	GSM8k	MMLUPro	TriviaQA
开源权重模型：
Llama-2-13B	1.6·10²³	54.1	67.3	83.9	74.9	55.7	45.6	38.4	41.5	28.1	23.9	81.3
Mistral-7B-v0.3	n/a	58.8	78.3	83.1	77.7	63.5	51.8	37.2	47.3	40.1	30	79.3
Llama-3.1-8B	7.2·10²³	61.8	79.5	81.6	76.6	66.9	56.4	33.9	51.3	56.5	34.7	80.3
Mistral-Nemo-12B	n/a	66.9	85.2	85.6	81.5	69.5	69.2	39.7	54.7	62.1	36.7	84.6
Qwen-2.5-7B	8.2·10²³	67.4	89.5	89.7	74.2	74.4	55.8	29.9	63.7	81.5	45.8	69.4
Gemma-2-9B	4.4·10²³	67.8	89.5	87.3	78.8	70.6	63	38	57.3	70.1	42	81.8
Qwen-2.5-14B	16.0·10²³	72.2	94	94	80	79.3	51.5	37.3	71	83.4	52.8	79.1
部分开源模型：
StableLM-2-12B	2.9·10²³	62.2	81.9	84.5	77.7	62.4	55.5	37.6	50.9	62	29.3	79.9
Zamba-2-7B	n/c	65.2	92.2	89.4	79.6	68.5	51.7	36.5	55.5	67.2	32.8	78.8
完全开源模型：
Amber-7B	0.5·10²³	35.2	44.9	74.5	65.5	24.7	26.1	18.7	21.8	4.8	11.7	59.3
OLMo-7B	1.0·10²³	38.3	46.4	78.1	68.5	28.3	27.3	24.8	23.7	9.2	12.1	64.1
MAP-Neo-7B	2.1·10²³	49.6	78.4	72.8	69.2	58	39.4	28.9	45.8	12.5	25.9	65.1
OLMo-0424-7B	0.9·10²³	50.7	66.9	80.1	73.6	54.3	50	29.6	43.9	27.7	22.1	58.8
DCLM-7B	1.0·10²³	56.9	79.8	82.3	77.3	64.4	39.3	28.8	47.5	46.1	31.3	72.1
OLMo-2-1124-7B	1.8·10²³	62.9	79.8	83.8	77.2	63.7	60.8	36.9	50.4	67.5	31	78
OLMo-2-1124-13B	4.6·10²³	68.3	83.5	86.4	81.5	67.5	70.7	46.7	54.2	75.1	35.1	81.9

模型详情

预训练

	OLMo 2 7B	OLMo 2 13B
预训练阶段 1 （OLMo-Mix-1124）	4 万亿 tokens （1 个 epoch）	5 万亿 tokens （1.2 个 epochs）
预训练阶段 2 （Dolmino-Mix-1124）	500 亿 tokens（3 次运行）已合并	1000 亿 tokens（3 次运行） 3000 亿 tokens（1 次运行）已合并
后训练（Tulu 3 SFT OLMo mix）	SFT + DPO + PPO （偏好混合数据）	SFT + DPO + PPO （偏好混合数据）

阶段 1：初始预训练

数据集：OLMo-Mix-1124（3.9 万亿 tokens）
占比：总预训练预算的 90% 以上
7B 模型：约 1 个 epoch
13B 模型：1.2 个 epochs（5 万亿 tokens）

阶段 2：微调

数据集：Dolmino-Mix-1124（8430 亿 tokens）
三种训练混合数据：
- 500 亿 tokens
- 1000 亿 tokens
- 3000 亿 tokens
混合数据构成：50% 高质量数据 + 学术/问答/指令/数学内容

模型合并

7B 模型：3 个基于 500 亿混合数据训练的版本，通过 model souping 方法合并
13B 模型：3 个基于 1000 亿混合数据训练的版本 + 1 个基于 3000 亿混合数据训练的版本，合并为最终检查点

偏差、风险与局限性

许可与使用

OLMo 2 采用 Apache 2.0 许可协议。 OLMo 2 旨在用于研究和教育目的。有关更多信息，请参阅我们的负责任使用指南。

引用格式

@misc{olmo20242olmo2furious,
      title={2 OLMo 2 Furious}, 
      author={Team OLMo and Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Michal Guerquin and Hamish Ivison and Pang Wei Koh and Jiacheng Liu and Saumya Malik and William Merrill and Lester James V. Miranda and Jacob Morrison and Tyler Murray and Crystal Nam and Valentina Pyatkin and Aman Rangapur and Michael Schmitz and Sam Skjonsberg and David Wadden and Christopher Wilhelm and Michael Wilson and Luke Zettlemoyer and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
      year={2024},
      eprint={2501.00656},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.00656}, 
}

模型卡片联系方式

如本模型卡片存在错误，请联系 olmo@allenai.org。