Llama-3.1-Tulu-3-8B

Tülu 3 是领先的指令遵循模型系列，提供包含完全开源数据、代码和训练流程的后训练工具包，旨在作为现代技术的综合指南。这是训练完全开源模型（如我们的 OLMo 模型）更大流程中的一步。 Tülu 3 旨在在除聊天之外的多种任务（如 MATH、GSM8K 和 IFEval）上实现最先进的性能。

模型描述

模型类型： 基于公开可用、合成和人工创建的混合数据集训练的模型。
语言（NLP）： 主要为英语
许可证： Llama 3.1 社区许可协议
微调基础模型： allenai/Llama-3.1-Tulu-3-8B-DPO

模型来源

训练代码库： https://github.com/allenai/open-instruct
评估代码库： https://github.com/allenai/olmes
论文： https://arxiv.org/abs/2411.15124
演示： https://playground.allenai.org/

模型系列

阶段	Llama 3.1 8B	Llama 3.1 70B
基础模型	meta-llama/Llama-3.1-8B	meta-llama/Llama-3.1-70B
监督微调（SFT）	allenai/Llama-3.1-Tulu-3-8B-SFT	allenai/Llama-3.1-Tulu-3-70B-SFT
直接偏好优化（DPO）	allenai/Llama-3.1-Tulu-3-8B-DPO	allenai/Llama-3.1-Tulu-3-70B-DPO
最终模型（RLVR）	allenai/Llama-3.1-Tulu-3-8B	allenai/Llama-3.1-Tulu-3-70B
奖励模型（RM）	allenai/Llama-3.1-Tulu-3-8B-RM	（与 8B 相同）

阶段	Llama 3.1 405B
基础模型	meta-llama/llama-3.1-405B
监督微调（SFT）	allenai/llama-3.1-Tulu-3-405B-SFT
直接偏好优化（DPO）	allenai/llama-3.1-Tulu-3-405B-DPO
最终模型（RLVR）	allenai/llama-3.1-Tulu-3-405B
奖励模型（RM）	（与 8B 相同）

使用模型

通过 HuggingFace 加载

若要通过 HuggingFace 加载模型，请使用以下代码片段：

from transformers import AutoModelForCausalLM

tulu_model = AutoModelForCausalLM.from_pretrained("allenai/Llama-3.1-Tulu-3-8B")

VLLM

作为一个 Llama 基础模型，该模型可以通过以下方式轻松部署：

vllm serve allenai/Llama-3.1-Tulu-3-8B

请注意，由于Llama的聊天模板较长，你可能需要使用--max_model_len=8192。

聊天模板

我们模型的聊天模板格式如下：

<|user|>\nHow are you doing?\n<|assistant|>\nI'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>

或者使用展开的新行：

<|user|>
How are you doing?
<|assistant|>
I'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>

它也嵌入在分词器中，用于 tokenizer.apply_chat_template。

系统提示

在 Ai2 演示中，我们默认使用以下系统提示：

You are Tulu 3, a helpful and harmless AI Assistant built by the Allen Institute for AI.

该模型在训练时并未考虑特定的系统提示。

偏差、风险与局限性

Tülu3 模型的安全训练有限，且不像 ChatGPT 那样在部署时对响应进行实时过滤，因此该模型可能会生成有问题的输出（尤其是在被提示这样做时）。

用于训练基础 Llama 3.1 模型的语料库规模和组成尚不清楚，但它可能包含网络数据以及书籍、代码等技术来源的混合内容。

有关这一点的示例，请参见 Falcon 180B 模型卡片。

性能

基准测试（评估）	Tülu 3 SFT 8B	Tülu 3 DPO 8B	Tülu 3 8B	Llama 3.1 8B Instruct	Qwen 2.5 7B Instruct	Magpie 8B	Gemma 2 9B Instruct	Ministral 8B Instruct
平均值	60.4	64.4	64.8	62.2	57.8	44.7	55.2	58.3
MMLU（0 样本，思维链）	65.9	68.7	68.2	71.2	76.6	62.0	74.6	68.5
PopQA（15 样本）	29.3	29.3	29.1	20.2	18.1	22.5	28.3	20.2
TruthfulQA（6 样本）	46.8	56.1	55.0	55.1	63.1	57.0	61.4	55.5
BigBenchHard（3 样本，思维链）	67.9	65.8	66.0	62.8	21.7	0.9	2.5	56.2
DROP（3 样本）	61.3	62.5	62.6	61.5	54.4	49.4	58.8	56.2
MATH（4 样本思维链，Flex）	31.5	42.0	43.7	42.5	14.8	5.1	29.8	40.0
GSM8K（8 样本，思维链）	76.2	84.3	87.6	83.4	83.8	61.2	79.7	80.0
HumanEval（pass@10）	86.2	83.9	83.9	86.3	93.1	75.4	71.7	91.0
HumanEval+（pass@10）	81.4	78.6	79.2	82.9	89.7	69.1	67.0	88.5
IFEval（宽松提示）	72.8	81.1	82.4	80.6	74.7	38.8	69.9	56.4
AlpacaEval 2（LC % 胜率）	12.4	33.5	34.5	24.2	29.0	49.0	43.7	31.4
安全性（6 任务平均值）	93.1	87.2	85.5	75.2	75.0	46.4	75.5	56.2

基准测试（评估）	Tülu 3 70B SFT	Tülu 3 DPO 70B	Tülu 3 70B	Llama 3.1 70B Instruct	Qwen 2.5 72B Instruct	Hermes 3 Llama 3.1 70B	Nemotron Llama 3.1 70B
平均值	72.6	75.9	76.0	73.4	71.5	68.3	65.5
MMLU（0 样本，思维链）	78.9	83.3	83.1	85.3	85.5	80.4	83.8
PopQA（15 样本）	48.6	46.3	46.5	46.4	30.6	48.1	36.4
TruthfulQA（6 样本）	55.7	67.9	67.6	66.8	69.9	66.5	62.6
BigBenchHard（3 样本，思维链）	82.7	81.8	82.0	73.8	67.2	82.1	0.7
DROP（3 样本）	77.2	74.1	74.3	77.0	34.2	73.2	68.8
MATH（4 样本思维链，Flex）	53.7	62.3	63.0	56.4	74.3	41.9	55.0
GSM8K（8 样本，思维链）	91.1	93.5	93.5	93.7	89.5	90.0	84.7
HumanEval（pass@10）	92.9	92.4	92.4	93.6	94.0	89.6	94.1
HumanEval+（pass@10）	87.3	88.4	88.0	89.5	90.8	85.9	85.5
IFEval（宽松提示）	82.1	82.6	83.2	88.0	87.6	76.0	79.9
AlpacaEval 2（LC % 胜率）	26.3	49.6	49.8	33.4	47.7	28.4	66.1
安全性（6 任务平均值）	94.4	89.0	88.3	76.5	87.0	57.9	69.0
基准测试（评估）	Tülu 3 405B SFT	Tülu 3 405B DPO	Tülu 3 405B	Llama 3.1 405B Instruct	Nous Hermes 3 405B	Deepseek V3	GPT 4o (11-24)
-----------------	----------------	----------------	-------------	------------------------	-------------------	-------------	----------------
不含安全性的平均值	76.3	79.0	80.0	78.1	74.4	79.0	80.5
含安全性的平均值	77.5	79.6	80.7	79.0	73.5	75.9	81.6
MMLU（5 样本，思维链）	84.4	86.6	87.0	88.0	84.9	82.1	87.9
PopQA（3 样本）	55.7	55.4	55.5	52.9	54.2	44.9	53.6
BigBenchHard（0 样本，思维链）	88.0	88.8	88.6	87.1	87.7	89.5	83.3
MATH（4 样本，Flex）	63.4	59.9	67.3	66.6	58.4	72.5	68.8
GSM8K（8 样本，思维链）	93.6	94.2	95.5	95.4	92.7	94.1	91.7
HumanEval（pass@10）	95.7	97.2	95.9	95.9	92.3	94.6	97.0
HumanEval+（pass@10）	93.3	93.9	92.9	90.3	86.9	91.6	92.7
IFEval（宽松提示）	82.4	85.0	86.0	88.4	81.9	88.0	84.8
AlpacaEval 2（LC % 胜率）	30.4	49.8	51.4	38.5	30.2	53.5	65.0
安全性（6 任务平均值）	87.7	85.5	86.7	86.8	65.8	72.2	90.9

超参数

RLVR 的 PPO 设置：

学习率：3 × 10⁻⁷
折扣因子（gamma）：1.0
广义优势估计（lambda）：0.95
迷你批次数量（N_mb）：1
PPO 更新迭代次数（K）：4
PPO 裁剪系数（epsilon）：0.2
价值函数系数（c1）：0.1
梯度范数阈值：1.0
学习率调度：线性
生成温度：1.0
批次大小（有效）：224
最大令牌长度：2,048
最大提示令牌长度：2,048
对无 EOS 令牌响应的惩罚奖励值：-10.0
响应长度：2,048
总轮次：100,000
KL 惩罚系数（beta）：0.05
预热比例（omega）：0.0

许可与使用

这些模型是使用包含第三方模型生成输出的混合数据集进行微调的，并受附加条款约束： Gemma 使用条款和 Qwen 许可协议（模型使用 Qwen 2.5 进行了改进）。

引用

如果 Tülu3 或任何相关材料对您的工作有所帮助，请引用：

@article{lambert2024tulu3,
  title = {Tülu 3: Pushing Frontiers in Open Language Model Post-Training},
  author = {
    Nathan Lambert and 
    Jacob Morrison and 
    Valentina Pyatkin and 
    Shengyi Huang and 
    Hamish Ivison and 
    Faeze Brahman and 
    Lester James V. Miranda and 
    Alisa Liu and 
    Nouha Dziri and 
    Shane Lyu and 
    Yuling Gu and 
    Saumya Malik and 
    Victoria Graf and 
    Jena D. Hwang and 
    Jiangjiang Yang and
    Ronan Le Bras and
    Oyvind Tafjord and
    Chris Wilhelm and
    Luca Soldaini and 
    Noah A. Smith and 
    Yizhong Wang and 
    Pradeep Dasigi and 
    Hannaneh Hajishirzi
  },
  year = {2024},
  email = {tulu@allenai.org}
}

Open LLM 排行榜评估结果

详细结果可查看此处! 汇总结果可查看此处!

指标	数值 (%)
平均值	25.88
IFEval（0次提示）	82.55
BBH（3次提示）	16.86
MATH Lvl 5（4次提示）	18.88
GPQA（0次提示）	6.26
MuSR（0次提示）	10.52
MMLU-PRO（5次提示）	20.23