DCLM-Baseline-7B 是一个拥有 70 亿参数的语言模型,它在 DCLM-Baseline 数据集上进行训练。该数据集是作为语言模型数据竞赛(DataComp for Language Models, DCLM)基准的一部分精心构建的。本模型旨在展示系统化数据筛选技术对提升语言模型性能的有效性。
| 规模 | 训练 Token 数 | 层数 | 隐藏层大小 | 注意力头数 | 上下文长度 |
|---|---|---|---|---|---|
| 7B | 2.5T | 32 | 4096 | 32 | 2048 |
首先安装 open_lm
pip install git+https://github.com/mlfoundations/open_lm.git然后:
from open_lm.hf import *
from openmind import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("AI-Research/DCLM-7B")
model = AutoModelForCausalLM.from_pretrained("AI-Research/DCLM-7B",device_map='npu:0')
inputs = tokenizer(["Machine learning is"], return_tensors="pt").to(model.device)
gen_kwargs = {"max_new_tokens": 50, "top_p": 0.8, "temperature": 0.8, "do_sample": True, "repetition_penalty": 1.1}
output = model.generate(inputs['input_ids'], **gen_kwargs)
output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
print(output)该模型是基于以下配置进行训练的:
以下是 DCLM-Baseline-7B 在各项任务上的评估结果(使用 llm-foundry 评估套件)
| 任务 | 得分 |
|---|---|
| MMLU(零样本) | 0.5766 |
| MMLU(少样本) | 0.6372 |
| HellaSwag(零样本) | 0.7987 |
| HellaSwag | 0.8043 |
| Jeopardy | 0.4745 |
| TriviaQA | 0.5270 |
| GSM8K(思维链) | 0.0250 |
| AGI Eval SAT 数学(思维链) | 0.0136 |
| AQuA(思维链) | 0.0490 |
| SVAMP(思维链) | 0.4900 |
| BigBench QA Wikidata | 0.7120 |
| ARC Easy | 0.8220 |
| ARC Challenge | 0.5990 |
| BigBench Misconceptions | 0.6986 |
| COPA | 0.8500 |
| SIQA | 0.8291 |
| CommonsenseQA | 0.8018 |
| PIQA | 0.8128 |
| OpenBookQA | 0.4540 |
| BigBench Novel Concepts | 0.7188 |
| BigBench Strange Stories | 0.7586 |
| BigBench Strategy QA | 0.6173 |
| LAMBADA | 0.8220 |
| Winograd | 0.8828 |
| Winogrande | 0.7269 |
| BigBench Conlang Translation | 0.0244 |
| BigBench Language Identification | 0.5219 |
| BigBench Conceptual Combinations | 0.6990 |
| BigBench Elementary Math QA | 0.3431 |
| BigBench Dyck Languages | 0.4930 |
| AGI Eval LSAT AR | 0.2435 |
| BigBench CS Algorithms | 0.6121 |
| BigBench Logical Deduction | 0.3620 |
| BigBench Operators | 0.4857 |
| BigBench Repeat Copy Logic | 0.4063 |
| Simple Arithmetic (no spaces) | 0.2940 |
| Simple Arithmetic (with spaces) | 0.3110 |
| MathQA | 0.3098 |
| LogiQA | 0.4132 |
| PubMedQA | 0.7060 |
| SQuAD | 0.5856 |
| AGI Eval LSAT RC | 0.6716 |
| AGI Eval LSAT LR | 0.5392 |
| CoQA | 0.4074 |
| BigBench Understanding Fables | 0.6825 |
| BoolQ | 0.8343 |
| AGI Eval SAT EN | 0.7670 |
| Winogender MC (Female) | 0.6000 |
| Winogender MC (Male) | 0.5500 |
| Enterprise PII Classification | 0.7676 |
| BBQ | 0.6912 |
| GPQA Main | 0.2612 |
| GPQA Diamond | 0.2475 |
注:所有分数均以 0 到 1 之间的十进制数值呈现,表示各任务中正确答案的比例或模型的性能表现。
以下是本模型与其他7B规模模型的对比。
| 模型 | 参数规模 | 训练数据量(tokens) | 是否使用开放数据集? | CORE | MMLU | EXTENDED |
|---|---|---|---|---|---|---|
| 开放权重,非开放数据集 | ||||||
| Llama2 | 7B | 2T | ❌ | 49.2 | 45.8 | 34.1 |
| DeepSeek | 7B | 2T | ❌ | 50.7 | 48.5 | 35.3 |
| Mistral-0.3 | 7B | ? | ❌ | 57.0 | 62.7 | 45.1 |
| QWEN-2 | 7B | ? | ❌ | 57.5 | 71.9 | 50.5 |
| Llama3 | 8B | 15T | ❌ | 57.6 | 66.2 | 46.3 |
| Gemma | 8B | 6T | ❌ | 57.8 | 64.3 | 44.6 |
| Phi-3 | 7B | ? | ❌ | 61.0 | 69.9 | 57.9 |
| 开放权重,开放数据集 | ||||||
| Falcon | 7B | 1T | ✅ | 44.1 | 27.4 | 25.1 |
| OLMo-1.7 | 7B | 2.1T | ✅ | 47.0 | 54.0 | 34.2 |
| MAP-Neo | 7B | 4.5T | ✅ | 50.2 | 57.1 | 40.4 |
| DCLM-7B | 7B | 2.5T | ✅ | 56.1 | 63.7 | 43.6 |
尽管DCLM-Baseline-7B在一系列任务中表现出较强的性能,但需要注意以下几点:
用户应注意,此模型与所有大型语言模型一样,有可能生成有害或带有偏见的内容。在没有适当的保障措施和人工监督的情况下,不应将其用于对个人做出决策或应用于敏感场景。
如果您在研究中使用了本模型,请引用:
@article{Li2024DataCompLM,
title={DataComp-LM: In search of the next generation of training sets for language models},
author={Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Gadre and Hritik Bansal and Etash Guha and Sedrick Keh and Kushal Arora and [... full author list]},
journal={arXiv preprint arXiv:2406.11794},
year={2024}
}