MMLU 数据集卡片

数据集描述

代码库：https://github.com/hendrycks/test
论文：https://arxiv.org/abs/2009.03300

数据集摘要

Dan Hendrycks、Collin Burns、Steven Basart、Andy Zou、Mantas Mazeika、Dawn Song 和 Jacob Steinhardt（ICLR 2021）的《Measuring Massive Multitask Language Understanding》（测量大规模多任务语言理解能力）。

这是一项大规模多任务测试，包含来自各个知识分支的多项选择题。该测试涵盖人文科学、社会科学、自然科学以及其他一些对部分人群学习而言重要的领域。它包含57个任务，其中包括初等数学、美国历史、计算机科学、法律等。要在这项测试中获得高准确率，模型必须具备广泛的世界知识和问题解决能力。

任务完整列表：['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']

支持的任务与排行榜

模型	作者	人文科学	社会科学	理工科（STEM）	其他	平均值
UnifiedQA	Khashabi et al., 2020	45.6	56.6	40.2	54.6	48.9
GPT-3 (少样本)	Brown et al., 2020	40.8	50.4	36.7	48.8	43.9
GPT-2	Radford et al., 2019	32.8	33.3	30.2	33.1	32.4
随机基线	N/A	25.0	25.0	25.0	25.0	25.0

语言

英语

数据集结构

数据实例

来自解剖学子任务的一个示例如下：

{
  "question": "What is the embryological origin of the hyoid bone?",
  "choices": ["The first pharyngeal arch", "The first and second pharyngeal arches", "The second pharyngeal arch", "The second and third pharyngeal arches"],
  "answer": "D"
}

数据字段

question：字符串特征
choices：包含4个字符串特征的列表
answer：类别标签特征

数据拆分

auxiliary_train：来自ARC、MC_TEST、OBQA、RACE等的辅助多项选择训练题
dev：每个子任务5个示例，用于少样本场景
test：每个子任务至少100个示例

	auxiliary_train	dev	val	test
总计	99842	285	1531	14042

数据集创建

构建理由

Transformer模型通过在大规模文本语料库（包括整个维基百科、数千本书籍和众多网站）上进行预训练，推动了这一领域的最新进展。因此，这些模型接触到了大量关于专业主题的信息，而现有NLP基准测试大多未能对这些信息进行评估。为了弥合模型在预训练过程中所接触的广泛知识与现有成功衡量标准之间的差距，我们引入了一个新的基准测试，用于评估模型在人类学习的各种不同学科上的表现。

源数据

初始数据收集与标准化

[需要更多信息]

源语言生成者是谁？

[需要更多信息]

标注

标注过程

[需要更多信息]

标注者是谁？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据集的社会影响

[需要更多信息]

偏差讨论

[需要更多信息]

其他已知局限性

[需要更多信息]

其他信息

数据集构建者

[需要更多信息]

许可信息

MIT许可证

引用信息

如果您发现本数据集对您的研究有用，请考虑引用本测试以及它所借鉴的ETHICS数据集：

    @article{hendryckstest2021,
      title={Measuring Massive Multitask Language Understanding},
      author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
      journal={Proceedings of the International Conference on Learning Representations (ICLR)},
      year={2021}
    }

    @article{hendrycks2021ethics,
      title={Aligning AI With Shared Human Values},
      author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
      journal={Proceedings of the International Conference on Learning Representations (ICLR)},
      year={2021}
    }

贡献

感谢 @andyzoujm 添加了此数据集。

MMLU 数据集卡片

数据集描述

代码库：https://github.com/hendrycks/test
论文：https://arxiv.org/abs/2009.03300

数据集摘要

支持的任务与排行榜

模型	作者	人文科学	社会科学	理工科（STEM）	其他	平均值
UnifiedQA	Khashabi et al., 2020	45.6	56.6	40.2	54.6	48.9
GPT-3 (少样本)	Brown et al., 2020	40.8	50.4	36.7	48.8	43.9
GPT-2	Radford et al., 2019	32.8	33.3	30.2	33.1	32.4
随机基线	N/A	25.0	25.0	25.0	25.0	25.0

语言

英语

数据集结构

数据实例

来自解剖学子任务的一个示例如下：

{
  "question": "What is the embryological origin of the hyoid bone?",
  "choices": ["The first pharyngeal arch", "The first and second pharyngeal arches", "The second pharyngeal arch", "The second and third pharyngeal arches"],
  "answer": "D"
}

数据字段

question：字符串特征
choices：包含4个字符串特征的列表
answer：类别标签特征

数据拆分

auxiliary_train：来自ARC、MC_TEST、OBQA、RACE等的辅助多项选择训练题
dev：每个子任务5个示例，用于少样本场景
test：每个子任务至少100个示例

	auxiliary_train	dev	val	test
总计	99842	285	1531	14042

数据集创建

构建理由

源数据

初始数据收集与标准化

[需要更多信息]

源语言生成者是谁？

[需要更多信息]

标注

标注过程

[需要更多信息]

标注者是谁？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据集的社会影响

[需要更多信息]

偏差讨论

[需要更多信息]

其他已知局限性

[需要更多信息]

其他信息

数据集构建者

[需要更多信息]

许可信息

MIT许可证

引用信息

如果您发现本数据集对您的研究有用，请考虑引用本测试以及它所借鉴的ETHICS数据集：

    @article{hendryckstest2021,
      title={Measuring Massive Multitask Language Understanding},
      author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
      journal={Proceedings of the International Conference on Learning Representations (ICLR)},
      year={2021}
    }

    @article{hendrycks2021ethics,
      title={Aligning AI With Shared Human Values},
      author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
      journal={Proceedings of the International Conference on Learning Representations (ICLR)},
      year={2021}
    }

贡献

感谢 @andyzoujm 添加了此数据集。

MMLU 数据集卡片

目录

数据集描述

数据集摘要

支持的任务与排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

构建理由

源数据

初始数据收集与标准化

源语言生成者是谁？

标注

标注过程

标注者是谁？

个人和敏感信息

使用数据的注意事项

数据集的社会影响

偏差讨论

其他已知局限性

其他信息

数据集构建者

许可信息

引用信息

贡献

MMLU 数据集卡片

目录

数据集描述

数据集摘要

支持的任务与排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

构建理由

源数据

初始数据收集与标准化

源语言生成者是谁？

标注

标注过程

标注者是谁？

个人和敏感信息

使用数据的注意事项

数据集的社会影响

偏差讨论

其他已知局限性

其他信息

数据集构建者

许可信息

引用信息

贡献