DCLM-Baseline-7B 模型卡片

DCLM-Baseline-7B 是一个拥有 70 亿参数的语言模型，它在 DCLM-Baseline 数据集上进行训练。该数据集是作为语言模型数据竞赛（DataComp for Language Models, DCLM）基准的一部分精心构建的。本模型旨在展示系统化数据筛选技术对提升语言模型性能的有效性。

模型详情

规模	训练 Token 数	层数	隐藏层大小	注意力头数	上下文长度
7B	2.5T	32	4096	32	2048

模型说明

开发机构： 语言模型数据竞赛（DataComp for Language Models, DCLM）团队
模型类型： 仅解码器 Transformer 语言模型
支持语言： 英语（主要）
许可证： Apple Sample Code License
联系方式： contact@datacomp.ai
日期： 2024 年 6 月

模型来源

代码库： https://github.com/mlfoundations/dclm
论文： DataComp-LM: In search of the next generation of training sets for language models

使用模型

首先安装 open_lm

pip install git+https://github.com/mlfoundations/open_lm.git

然后：

from open_lm.hf import *
from openmind import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("AI-Research/DCLM-7B")
model = AutoModelForCausalLM.from_pretrained("AI-Research/DCLM-7B",device_map='npu:0')

inputs = tokenizer(["Machine learning is"], return_tensors="pt").to(model.device)
gen_kwargs = {"max_new_tokens": 50, "top_p": 0.8, "temperature": 0.8, "do_sample": True, "repetition_penalty": 1.1}
output = model.generate(inputs['input_ids'], **gen_kwargs)
output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
print(output)

训练详情

该模型是基于以下配置进行训练的：

架构：仅解码器的 Transformer
框架：PyTorch 搭配 OpenLM
优化器：AdamW
学习率：2e-3（峰值）
权重衰减：0.05
批处理大小：2048 个序列
序列长度：2048 个 token
总训练 token 量：2.5T
硬件：在 H100 GPU 上训练

评估

以下是 DCLM-Baseline-7B 在各项任务上的评估结果（使用 llm-foundry 评估套件）

任务	得分
MMLU（零样本）	0.5766
MMLU（少样本）	0.6372
HellaSwag（零样本）	0.7987
HellaSwag	0.8043
Jeopardy	0.4745
TriviaQA	0.5270
GSM8K（思维链）	0.0250
AGI Eval SAT 数学（思维链）	0.0136
AQuA（思维链）	0.0490
SVAMP（思维链）	0.4900
BigBench QA Wikidata	0.7120
ARC Easy	0.8220
ARC Challenge	0.5990
BigBench Misconceptions	0.6986
COPA	0.8500
SIQA	0.8291
CommonsenseQA	0.8018
PIQA	0.8128
OpenBookQA	0.4540
BigBench Novel Concepts	0.7188
BigBench Strange Stories	0.7586
BigBench Strategy QA	0.6173
LAMBADA	0.8220
Winograd	0.8828
Winogrande	0.7269
BigBench Conlang Translation	0.0244
BigBench Language Identification	0.5219
BigBench Conceptual Combinations	0.6990
BigBench Elementary Math QA	0.3431
BigBench Dyck Languages	0.4930
AGI Eval LSAT AR	0.2435
BigBench CS Algorithms	0.6121
BigBench Logical Deduction	0.3620
BigBench Operators	0.4857
BigBench Repeat Copy Logic	0.4063
Simple Arithmetic (no spaces)	0.2940
Simple Arithmetic (with spaces)	0.3110
MathQA	0.3098
LogiQA	0.4132
PubMedQA	0.7060
SQuAD	0.5856
AGI Eval LSAT RC	0.6716
AGI Eval LSAT LR	0.5392
CoQA	0.4074
BigBench Understanding Fables	0.6825
BoolQ	0.8343
AGI Eval SAT EN	0.7670
Winogender MC (Female)	0.6000
Winogender MC (Male)	0.5500
Enterprise PII Classification	0.7676
BBQ	0.6912
GPQA Main	0.2612
GPQA Diamond	0.2475

注：所有分数均以 0 到 1 之间的十进制数值呈现，表示各任务中正确答案的比例或模型的性能表现。

对比

以下是本模型与其他7B规模模型的对比。

模型	参数规模	训练数据量（tokens）	是否使用开放数据集？	CORE	MMLU	EXTENDED
开放权重，非开放数据集
Llama2	7B	2T	❌	49.2	45.8	34.1
DeepSeek	7B	2T	❌	50.7	48.5	35.3
Mistral-0.3	7B	?	❌	57.0	62.7	45.1
QWEN-2	7B	?	❌	57.5	71.9	50.5
Llama3	8B	15T	❌	57.6	66.2	46.3
Gemma	8B	6T	❌	57.8	64.3	44.6
Phi-3	7B	?	❌	61.0	69.9	57.9
开放权重，开放数据集
Falcon	7B	1T	✅	44.1	27.4	25.1
OLMo-1.7	7B	2.1T	✅	47.0	54.0	34.2
MAP-Neo	7B	4.5T	✅	50.2	57.1	40.4
DCLM-7B	7B	2.5T	✅	56.1	63.7	43.6

局限性与偏差

尽管DCLM-Baseline-7B在一系列任务中表现出较强的性能，但需要注意以下几点：

该模型可能会展现出其训练数据中存在的偏差，其训练数据来源于网络爬取数据。
它尚未经过特定的对齐或安全微调，因此使用其输出时应谨慎。
在评估套件未包含的任务上，其性能可能会有所不同。
模型的知识局限于其训练数据的截止日期。

伦理考量

用户应注意，此模型与所有大型语言模型一样，有可能生成有害或带有偏见的内容。在没有适当的保障措施和人工监督的情况下，不应将其用于对个人做出决策或应用于敏感场景。

引用说明

如果您在研究中使用了本模型，请引用：

@article{Li2024DataCompLM,
  title={DataComp-LM: In search of the next generation of training sets for language models},
  author={Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Gadre and Hritik Bansal and Etash Guha and Sedrick Keh and Kushal Arora and [... full author list]},
  journal={arXiv preprint arXiv:2406.11794},
  year={2024}
}