BLOOM LM

大型科学开放科学开放获取多语言语言模型

模型卡片

版本 1.0 / 2022年5月26日

修改记录

新增示例代码并修改链接路径

模型详情

快速入门

以下代码展示了与bloom_3b交互的示例：

import torch
from openmind import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("PyTorch-NPU/bloom_3b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("PyTorch-NPU/bloom_3b", trust_remote_code=True, device_map="auto")

input = "Give three tips for staying healthy."
prompt = ("Below is an instrunction that describes a task. "
              "Write a response that appropriately completes the requests\n\n"
              f"### Instruction:\n{input}\n\n### Response:")
inputs = tokenizer(prompt, return_tensors="pt")
inputs = inputs.to(model.device)

pred = model.generate(**inputs, max_new_tokens=512, repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

基础信息

本节为所有希望了解该模型的人士提供相关信息。

点击展开

开发机构： BigScience

所有协作者均为志愿者，或已与其雇主达成协议。（参与者的进一步细分信息即将公布。）

模型类型： 基于Transformer的语言模型

版本： 1.0.0

支持语言： 多种；详见训练数据

许可证： RAIL License v1.0（链接）

预计发布日期： 2022年7月11日（星期一）

问题反馈邮箱： bigscience-contact@googlegroups.com

引用格式： BigScience, BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model. International, May 2021-May 2022

资金来源：

法国政府。
Hugging Face。
贡献者所在组织。（组织的进一步细分信息即将公布。）

技术规格

本节为从事模型开发工作的人员提供相关信息。

点击展开

有关复现训练的完整详细信息，请参见BLOOM训练 README。

模型架构： 基于Megatron-LM GPT2修改（详见论文、BLOOM Megatron代码）：

仅解码器架构
对词嵌入层应用层归一化（StableEmbedding；详见代码、论文）
ALiBI位置编码（详见论文），使用GeLU激活函数
3,002,557,440个参数：
- 642,252,800个嵌入参数
- 30层，32个注意力头
- 隐藏层维度为2560
- 使用的序列长度为2048个token（详见BLOOM分词器、分词器说明）

目标函数： 带均值归约的交叉熵（详见API文档）。

计算基础设施： Jean Zay公共超级计算机，由法国政府提供（详见公告）。

硬件：384块A100 80GB GPU（48个节点）：
- 额外32块A100 80GB GPU（4个节点）作为备用
- 每个节点8块GPU，采用NVLink 4 GPU间连接，4条OmniPath链路
- CPU：AMD
- 每个节点CPU内存：512GB
- 每个节点GPU内存：640GB
- 节点间连接：Omni-Path Architecture (OPA)
- NCCL通信网络：完全专用的子网
- 磁盘IO网络：与其他类型节点共享的网络
软件：
- Megatron-DeepSpeed（Github链接）
- DeepSpeed（Github链接）
- PyTorch（pytorch-1.11搭配CUDA-11.5；详见Github链接）
- apex（Github链接）

训练

训练轮次：1（当前目标）
日期：
- 开始时间：2022年3月11日上午11:42（太平洋标准时间）
- 结束时间：2022年7月5日
预计训练成本：相当于200万至500万美元的云计算费用（包括初步实验）
服务器训练地点：法国法兰西岛大区

分词

BLOOM分词器（链接）是一个经过训练的子词分词器，其训练方式如下：

字节级别的字节对编码（BPE）算法
简单的预分词规则，无归一化处理
词汇表大小为250,680

它是在语料库初步版本的一个子集上进行训练的，并对每种语言采用alpha加权。

环境影响

点击展开

用于训练的超级计算机 Jean Zay（网站）主要使用核能。其产生的热量被回收用于校园住房供暖。

预计碳排放： （训练完成后公布。）

预计电力使用量： （训练完成后公布。）

用途

本节解答有关模型预期用途的问题，讨论可预见的模型用户（包括受模型影响的用户），并描述被视为超出范围或滥用模型的情况。本节为任何考虑使用该模型或受该模型影响的人士提供信息。

点击展开

预期用途

创建此模型旨在支持大型语言模型（LLMs）的公共研究。LLMs 旨在用于语言生成，或作为可进一步针对特定任务进行微调的预训练基础模型。以下用例并非详尽无遗。

直接用途

文本生成
探索语言模型生成文本的特征
- 示例：完形填空测试、反事实推理、通过重新表述进行生成

下游用途

利用语言模型的任务包括：信息提取、问答、文本摘要

滥用和超出范围的用途

本节说明用户不应将模型用于哪些方面。

有关详细的使用限制，请参见 BLOOM 许可协议附件 A。以下列表并非详尽无遗，但列出了一些易于预见的有问题的用例。

超出范围的用途

将本模型用于高风险场景超出了本模型的使用范围。该模型并非为关键决策或对个人生计或福祉有任何重大影响的用途而设计。模型输出的内容可能看似真实，但实则并不准确。

超出范围的用途包括：

在生物医学领域、政治和法律领域或金融领域的使用
用于评估或评分个人，例如用于就业、教育或信贷评估
将模型应用于关键自动决策、生成事实性内容、创建可靠摘要或生成必须准确的预测

滥用

故意将模型用于造成伤害、侵犯人权或其他恶意活动，均属于对本模型的滥用。这包括：

生成垃圾信息
虚假信息和影响操作
贬低和诽谤
骚扰和虐待
欺骗
未经同意的冒充和模仿
未经同意的监视
未按 RAIL 许可协议（使用限制）的规定对模型生成的内容进行归属标注

目标用户

直接用户

普通大众
研究人员
学生
教育工作者
工程师/开发人员
非商业实体
社区倡导者，包括人权和公民权利组织

间接用户

直接用户创建的衍生产品的使用者，例如使用具有预期用途的软件的用户
许可中描述的模型衍生产品的用户

其他受影响方（利益相关者）

大语言模型（LLM）所提及的个人和群体
接触到大语言模型（LLM）输出或基于其做出的决策的个人和群体
其原创作品被纳入大语言模型（LLM）的个人和群体

训练数据

本节提供训练数据的高层概述。对于任何希望了解模型学习内容基础的人来说，这都是相关的。

点击展开

每个数据集的详细信息在各个[数据卡片]中提供。

训练数据包括：

45种自然语言
12种编程语言
1.5TB预处理文本，转换为3500亿个独特标记（更多信息见标记器部分）。

语言

饼图显示了训练数据中语言的分布情况。

显示训练数据中语言分布的饼图

下表进一步展示了尼日尔-刚果语系和印度语系语言在训练数据中的分布。

点击展开

尼日尔-刚果语系	百分比	印度语系	百分比
奇通布卡语	0.00002	阿萨姆语	0.01
基库尤语	0.00004	奥里亚语	0.04
班巴拉语	0.00004	古吉拉特语	0.04
阿坎语	0.00007	马拉地语	0.05
齐聪加语	0.00007	旁遮普语	0.05
塞索托语	0.00007	卡纳达语	0.06
奇切瓦语	0.0001	尼泊尔语	0.07
塞茨瓦纳语	0.0002	泰卢固语	0.09
北索托语	0.0002	马拉雅拉姆语	0.10
丰语	0.0002	乌尔都语	0.10
基隆迪语	0.0003	泰米尔语	0.20
沃洛夫语	0.0004	孟加拉语	0.50
卢干达语	0.0004	印地语	0.70
奇绍纳语	0.001
伊西祖鲁语	0.001
伊博语	0.001
科萨语	0.001
基尼亚卢旺达语	0.003
约鲁巴语	0.006
斯瓦希里语	0.02

下表显示了编程语言的分布情况。

点击展开

扩展名	语言	文件数量
java	Java	5,407,724
php	PHP	4,942,186
cpp	C++	2,503,930
py	Python	2,435,072
js	JavaScript	1,905,518
cs	C#	1,577,347
rb	Ruby	678,413
cc	C++	443,054
hpp	C++	391,048
lua	Lua	352,317
go	GO	227,763
ts	TypeScript	195,254
C	C	134,537
scala	Scala	92,052
hh	C++	67,161
H	C++	55,899
tsx	TypeScript	33,107
rs	Rust	29,693
phpt	PHP	9,702
c++	C++	1,342
h++	C++	791
php3	PHP	540
phps	PHP	270
php5	PHP	166
php4	PHP	29

风险与局限性

本节指出可预见的危害和误解。

点击展开

模型可能会：

过度呈现某些观点，而对其他观点呈现不足
包含刻板印象
包含个人信息
生成以下内容：
- 仇恨、辱骂或暴力语言
- 歧视性或偏见性语言
- 可能不适用于所有场景的内容，包括性内容
出现错误，包括将不正确的信息当作事实输出
生成不相关或重复的输出

评估

本节描述评估协议并提供结果。

点击展开

指标

本节描述计算性能的不同方式及其原因。

包括：

指标	选择原因
Perplexity	用于量化训练期间模型改进的标准指标
Cross Entropy Loss	语言模型的标准目标函数。

以及针对特定任务的多种不同指标。（更多评估指标将在评估协议完成后推出。）

影响因素

本节列出BLOOM模型的一些不同方面。重点关注可能导致模型行为出现高方差的方面。

语言，例如英语或约鲁巴语
领域，例如新闻专线或故事
人口统计特征，例如性别或国籍

结果

结果基于影响因素和指标。

零样本评估：

JSON文件请参见此仓库：https://github.com/bigscience-workshop/evaluation-results

任务	语言	指标	BLOOM-2B5
arc_challenge	eng	acc ↑	0.28
arc_easy	eng	acc ↑	0.595
axb (10个提示的中位数)	eng	acc ↑	0.443
axg (10个提示的中位数)	eng	acc ↑	0.5
boolq (11个提示的中位数)	eng	acc ↑	0.617
cb (15个提示的中位数)	eng	acc ↑	0.304
cola (5个提示的中位数)	eng	acc ↑	0.611
copa (9个提示的中位数)	eng	acc ↑	0.63
crows_pairs_english (6个提示的中位数)	eng	acc ↑	0.497
crows_pairs_french (7个提示的中位数)	fra	acc ↑	0.503
diabla (2个提示的中位数)	eng	acc ↑	0.289
gsarti/flores_101_afr	afr	byte_perplexity ↓	6.501
gsarti/flores_101_amh	amh	byte_perplexity ↓	3.973
gsarti/flores_101_ara	ara	byte_perplexity ↓	1.808
gsarti/flores_101_asm	asm	byte_perplexity ↓	5.699
gsarti/flores_101_ast	ast	byte_perplexity ↓	3.925
gsarti/flores_101_azj	azj	byte_perplexity ↓	6.943
gsarti/flores_101_bel	bel	byte_perplexity ↓	3.614
gsarti/flores_101_ben	ben	byte_perplexity ↓	5.121
gsarti/flores_101_bos	bos	byte_perplexity ↓	5.653
gsarti/flores_101_bul	bul	byte_perplexity ↓	2.701
gsarti/flores_101_cat	cat	byte_perplexity ↓	2.305
gsarti/flores_101_ceb	ceb	byte_perplexity ↓	6.291
gsarti/flores_101_ces	ces	byte_perplexity ↓	5.447
gsarti/flores_101_ckb	ckb	byte_perplexity ↓	3.726
gsarti/flores_101_cym	cym	byte_perplexity ↓	12.539
gsarti/flores_101_dan	dan	byte_perplexity ↓	5.183
gsarti/flores_101_deu	deu	byte_perplexity ↓	3.118
gsarti/flores_101_ell	ell	byte_perplexity ↓	2.468
gsarti/flores_101_eng	eng	byte_perplexity ↓	2.019
gsarti/flores_101_est	est	byte_perplexity ↓	9.117
gsarti/flores_101_fas	fas	byte_perplexity ↓	3.058
gsarti/flores_101_fin	fin	byte_perplexity ↓	6.847
gsarti/flores_101_fra	fra	byte_perplexity ↓	1.998
gsarti/flores_101_ful	ful	byte_perplexity ↓	11.466
gsarti/flores_101_gle	gle	byte_perplexity ↓	8.681
gsarti/flores_101_glg	glg	byte_perplexity ↓	3.03
gsarti/flores_101_guj	guj	byte_perplexity ↓	4.955
gsarti/flores_101_hau	hau	byte_perplexity ↓	10.758
gsarti/flores_101_heb	heb	byte_perplexity ↓	3.6
gsarti/flores_101_hin	hin	byte_perplexity ↓	4.713
gsarti/flores_101_hrv	hrv	byte_perplexity ↓	5.822
gsarti/flores_101_hun	hun	byte_perplexity ↓	6.44
gsarti/flores_101_hye	hye	byte_perplexity ↓	3.658
gsarti/flores_101_ibo	ibo	byte_perplexity ↓	5.565
gsarti/flores_101_ind	ind	byte_perplexity ↓	2.16
gsarti/flores_101_isl	isl	byte_perplexity ↓	8.082
gsarti/flores_101_ita	ita	byte_perplexity ↓	2.969
gsarti/flores_101_jav	jav	byte_perplexity ↓	7.057
gsarti/flores_101_jpn	jpn	byte_perplexity ↓	2.776
gsarti/flores_101_kam	kam	byte_perplexity ↓	11.073
gsarti/flores_101_kan	kan	byte_perplexity ↓	5.552
gsarti/flores_101_kat	kat	byte_perplexity ↓	2.523
gsarti/flores_101_kaz	kaz	byte_perplexity ↓	3.39
gsarti/flores_101_kea	kea	byte_perplexity ↓	8.919
gsarti/flores_101_kir	kir	byte_perplexity ↓	3.729
gsarti/flores_101_kor	kor	byte_perplexity ↓	3.933
gsarti/flores_101_lao	lao	byte_perplexity ↓	2.908
gsarti/flores_101_lav	lav	byte_perplexity ↓	7.777
gsarti/flores_101_lin	lin	byte_perplexity ↓	7.525
gsarti/flores_101_lit	lit	byte_perplexity ↓	7.369
gsarti/flores_101_ltz	ltz	byte_perplexity ↓	8.801
gsarti/flores_101_lug	lug	byte_perplexity ↓	8.483
gsarti/flores_101_luo	luo	byte_perplexity ↓	11.976
gsarti/flores_101_mal	mal	byte_perplexity ↓	4.616
gsarti/flores_101_mar	mar	byte_perplexity ↓	5.483
gsarti/flores_101_mkd	mkd	byte_perplexity ↓	2.966
gsarti/flores_101_mlt	mlt	byte_perplexity ↓	15.005
gsarti/flores_101_mon	mon	byte_perplexity ↓	3.411
gsarti/flores_101_mri	mri	byte_perplexity ↓	7.474
gsarti/flores_101_msa	msa	byte_perplexity ↓	2.571
gsarti/flores_101_mya	mya	byte_perplexity ↓	2.414
gsarti/flores_101_nld	nld	byte_perplexity ↓	4.128
gsarti/flores_101_nob	nob	byte_perplexity ↓	5.403
gsarti/flores_101_npi	npi	byte_perplexity ↓	5.199
gsarti/flores_101_nso	nso	byte_perplexity ↓	8.155
gsarti/flores_101_nya	nya	byte_perplexity ↓	8.18
gsarti/flores_101_oci	oci	byte_perplexity ↓	4.862
gsarti/flores_101_orm	orm	byte_perplexity ↓	12.912
gsarti/flores_101_ory	ory	byte_perplexity ↓	5.189
gsarti/flores_101_pan	pan	byte_perplexity ↓	4.698
gsarti/flores_101_pol	pol	byte_perplexity ↓	4.626
gsarti/flores_101_por	por	byte_perplexity ↓	1.975
gsarti/flores_101_pus	pus	byte_perplexity ↓	4.496
gsarti/flores_101_ron	ron	byte_perplexity ↓	4.965
gsarti/flores_101_rus	rus	byte_perplexity ↓	2.05
gsarti/flores_101_slk	slk	byte_perplexity ↓	6.451
gsarti/flores_101_slv	slv	byte_perplexity ↓	6.62
gsarti/flores_101_sna	sna	byte_perplexity ↓	8.462
gsarti/flores_101_snd	snd	byte_perplexity ↓	5.466
gsarti/flores_101_som	som	byte_perplexity ↓	11.959
gsarti/flores_101_spa	spa	byte_perplexity ↓	1.897
gsarti/flores_101_srp	srp	byte_perplexity ↓	2.871
gsarti/flores_101_swe	swe	byte_perplexity ↓	5.055
gsarti/flores_101_swh	swh	byte_perplexity ↓	3.697
gsarti/flores_101_tam	tam	byte_perplexity ↓	4.539
gsarti/flores_101_tel	tel	byte_perplexity ↓	5.807
gsarti/flores_101_tgk	tgk	byte_perplexity ↓	3.599
gsarti/flores_101_tgl	tgl	byte_perplexity ↓	5.667
gsarti/flores_101_tha	tha	byte_perplexity ↓	2.366
gsarti/flores_101_tur	tur	byte_perplexity ↓	4.885
gsarti/flores_101_ukr	ukr	byte_perplexity ↓	2.724
gsarti/flores_101_umb	umb	byte_perplexity ↓	12.767
gsarti/flores_101_urd	urd	byte_perplexity ↓	1.98
gsarti/flores_101_uzb	uzb	byte_perplexity ↓	12.002
gsarti/flores_101_vie	vie	byte_perplexity ↓	1.766
gsarti/flores_101_wol	wol	byte_perplexity ↓	9.144
gsarti/flores_101_xho	xho	byte_perplexity ↓	7.403
gsarti/flores_101_yor	yor	byte_perplexity ↓	5.913
gsarti/flores_101_zho_simpl	zho_simpl	byte_perplexity ↓	2.277
gsarti/flores_101_zho_trad	zho_trad	byte_perplexity ↓	2.518
gsarti/flores_101_zul	zul	byte_perplexity ↓	8.534
headqa	esp	acc ↑	0.264
hellaswag	eng	acc ↑	0.412
logiqa	eng	acc ↑	0.207
mathqa	eng	acc ↑	0.25
mc_taco	eng	em ↑	0.119
mnli (15个提示的中位数)	eng	acc ↑	0.355
mnli_mismatched (15个提示的中位数)	eng	acc ↑	0.352
mrpc	eng	acc ↑	0.586
multirc (11个提示的中位数)	eng	acc ↑	0.538
openbookqa	eng	acc ↑	0.216
piqa	eng	acc ↑	0.708
prost	eng	acc ↑	0.227
pubmedqa	eng	acc ↑	0.616
qnli	eng	acc ↑	0.507
qqp (7个提示的中位数)	eng	acc ↑	0.384
race	eng	acc ↑	0.352
rte (6个提示的中位数)	eng	acc ↑	0.477
sciq	eng	acc ↑	0.892
sst (6个提示的中位数)	eng	acc ↑	0.518
triviaqa	eng	acc ↑	0.042
tydiqa_primary (24个提示的中位数)	eng	acc ↑	0.301
webqs	eng	acc ↑	0.017
wic (11个提示的中位数)	eng	acc ↑	0.502
winogrande	eng	acc ↑	0.586
wnli (6个提示的中位数)	eng	acc ↑	0.472
wsc (11个提示的中位数)	eng	acc ↑	0.442
humaneval	python	pass@1 ↑	0.155
humaneval	python	pass@10 ↑	0.322
humaneval	python	pass@100 ↑	0.555

训练时评估：

截至2022年5月25日，太平洋标准时间15:00：

训练损失：2.0
验证损失：2.2
困惑度：8.9

建议

本节提供有关警告和潜在缓解措施的信息。

点击展开

应让间接用户知晓其处理的内容是由LLM生成的。
用户应了解风险与局限性，并在必要时添加适当的年龄免责声明或设置访问限制界面。
使用该LLM预训练的模型应包含更新后的模型卡片（Model Card）。
模型用户应提供让受影响者反馈的渠道，例如用于接收意见的电子邮箱。

术语表与计算方法

本节定义常用术语及指标的计算方式。

点击展开

损失（Loss）： 用于计算模型已学习内容与数据所示内容（“真实值”）之间的差异。损失值越低越好。训练过程旨在最小化损失。
困惑度（Perplexity）： 基于模型对新数据出现概率的估计。困惑度越低越好。如果模型能100%准确预测下一个将要出现的标记，那么困惑度为1。其数学计算基于熵值。
高风险场景： 例如欧盟拟议的《人工智能法案》（Artificial Intelligence (AI) Act）中定义的“高风险AI系统”和“不可接受风险AI系统”。
关键决策： 例如美国拟议的《算法问责法案》（the United States' proposed Algorithmic Accountability Act）中定义的决策。
人权： 包括《世界人权宣言》（Universal Declaration of Human Rights）中定义的各项权利。
个人数据与个人信息： 个人数据和个人信息在多项数据保护法规中均有定义，例如欧盟《通用数据保护条例》（European Union's General Data Protection Regulation）中的“个人数据”；南非共和国《个人信息保护法》（Protection of Personal Information Act）、中华人民共和国《个人信息保护法》（Personal information protection law）中的“个人信息”。
敏感特征： 包括人权（参见《世界人权宣言》第2条，UHDR, Article 2）和个人信息法规（参见《通用数据保护条例》第9条；《个人信息保护法》第一章，Article 9; Protection of Personal Information Act, Chapter 1）中特别保护的类别。
欺骗： 故意误导他人相信虚假事物的行为，例如在社交媒体上创建冒充真人的僵尸账号或聊天机器人，或生成文本文件却不告知消费者该文本为机器生成。

模型卡片作者

大致按时间顺序和投入时间排序。

Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Christopher Akiki, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay, Niklas Muennighoff

BLOOM LM

大型科学开放科学开放获取多语言语言模型

模型卡片

版本 1.0 / 2022年5月26日

修改记录

新增示例代码并修改链接路径

模型详情

快速入门

以下代码展示了与bloom_3b交互的示例：

import torch
from openmind import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("PyTorch-NPU/bloom_3b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("PyTorch-NPU/bloom_3b", trust_remote_code=True, device_map="auto")

input = "Give three tips for staying healthy."
prompt = ("Below is an instrunction that describes a task. "
              "Write a response that appropriately completes the requests\n\n"
              f"### Instruction:\n{input}\n\n### Response:")
inputs = tokenizer(prompt, return_tensors="pt")
inputs = inputs.to(model.device)

pred = model.generate(**inputs, max_new_tokens=512, repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

基础信息

本节为所有希望了解该模型的人士提供相关信息。

点击展开

开发机构： BigScience

所有协作者均为志愿者，或已与其雇主达成协议。（参与者的进一步细分信息即将公布。）

模型类型： 基于Transformer的语言模型

版本： 1.0.0

支持语言： 多种；详见训练数据

许可证： RAIL License v1.0（链接）

预计发布日期： 2022年7月11日（星期一）

问题反馈邮箱： bigscience-contact@googlegroups.com

引用格式： BigScience, BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model. International, May 2021-May 2022

资金来源：

法国政府。
Hugging Face。
贡献者所在组织。（组织的进一步细分信息即将公布。）

技术规格

本节为从事模型开发工作的人员提供相关信息。

点击展开

有关复现训练的完整详细信息，请参见BLOOM训练 README。

模型架构： 基于Megatron-LM GPT2修改（详见论文、BLOOM Megatron代码）：

仅解码器架构
对词嵌入层应用层归一化（StableEmbedding；详见代码、论文）
ALiBI位置编码（详见论文），使用GeLU激活函数
3,002,557,440个参数：
- 642,252,800个嵌入参数
- 30层，32个注意力头
- 隐藏层维度为2560
- 使用的序列长度为2048个token（详见BLOOM分词器、分词器说明）

目标函数： 带均值归约的交叉熵（详见API文档）。

计算基础设施： Jean Zay公共超级计算机，由法国政府提供（详见公告）。

硬件：384块A100 80GB GPU（48个节点）：
- 额外32块A100 80GB GPU（4个节点）作为备用
- 每个节点8块GPU，采用NVLink 4 GPU间连接，4条OmniPath链路
- CPU：AMD
- 每个节点CPU内存：512GB
- 每个节点GPU内存：640GB
- 节点间连接：Omni-Path Architecture (OPA)
- NCCL通信网络：完全专用的子网
- 磁盘IO网络：与其他类型节点共享的网络
软件：
- Megatron-DeepSpeed（Github链接）
- DeepSpeed（Github链接）
- PyTorch（pytorch-1.11搭配CUDA-11.5；详见Github链接）
- apex（Github链接）

训练

训练轮次：1（当前目标）
日期：
- 开始时间：2022年3月11日上午11:42（太平洋标准时间）
- 结束时间：2022年7月5日
预计训练成本：相当于200万至500万美元的云计算费用（包括初步实验）
服务器训练地点：法国法兰西岛大区

分词

BLOOM分词器（链接）是一个经过训练的子词分词器，其训练方式如下：

字节级别的字节对编码（BPE）算法
简单的预分词规则，无归一化处理
词汇表大小为250,680

它是在语料库初步版本的一个子集上进行训练的，并对每种语言采用alpha加权。

环境影响

点击展开

用于训练的超级计算机 Jean Zay（网站）主要使用核能。其产生的热量被回收用于校园住房供暖。

预计碳排放： （训练完成后公布。）

预计电力使用量： （训练完成后公布。）

用途

点击展开

预期用途

直接用途

文本生成
探索语言模型生成文本的特征
- 示例：完形填空测试、反事实推理、通过重新表述进行生成

下游用途

利用语言模型的任务包括：信息提取、问答、文本摘要

滥用和超出范围的用途

本节说明用户不应将模型用于哪些方面。

有关详细的使用限制，请参见 BLOOM 许可协议附件 A。以下列表并非详尽无遗，但列出了一些易于预见的有问题的用例。

超出范围的用途

超出范围的用途包括：

在生物医学领域、政治和法律领域或金融领域的使用
用于评估或评分个人，例如用于就业、教育或信贷评估
将模型应用于关键自动决策、生成事实性内容、创建可靠摘要或生成必须准确的预测

滥用

故意将模型用于造成伤害、侵犯人权或其他恶意活动，均属于对本模型的滥用。这包括：

生成垃圾信息
虚假信息和影响操作
贬低和诽谤
骚扰和虐待
欺骗
未经同意的冒充和模仿
未经同意的监视
未按 RAIL 许可协议（使用限制）的规定对模型生成的内容进行归属标注

目标用户

直接用户

普通大众
研究人员
学生
教育工作者
工程师/开发人员
非商业实体
社区倡导者，包括人权和公民权利组织

间接用户

直接用户创建的衍生产品的使用者，例如使用具有预期用途的软件的用户
许可中描述的模型衍生产品的用户

其他受影响方（利益相关者）

大语言模型（LLM）所提及的个人和群体
接触到大语言模型（LLM）输出或基于其做出的决策的个人和群体
其原创作品被纳入大语言模型（LLM）的个人和群体

训练数据

本节提供训练数据的高层概述。对于任何希望了解模型学习内容基础的人来说，这都是相关的。

点击展开

每个数据集的详细信息在各个[数据卡片]中提供。

训练数据包括：

45种自然语言
12种编程语言
1.5TB预处理文本，转换为3500亿个独特标记（更多信息见标记器部分）。

语言

饼图显示了训练数据中语言的分布情况。

显示训练数据中语言分布的饼图

下表进一步展示了尼日尔-刚果语系和印度语系语言在训练数据中的分布。

点击展开

尼日尔-刚果语系	百分比	印度语系	百分比
奇通布卡语	0.00002	阿萨姆语	0.01
基库尤语	0.00004	奥里亚语	0.04
班巴拉语	0.00004	古吉拉特语	0.04
阿坎语	0.00007	马拉地语	0.05
齐聪加语	0.00007	旁遮普语	0.05
塞索托语	0.00007	卡纳达语	0.06
奇切瓦语	0.0001	尼泊尔语	0.07
塞茨瓦纳语	0.0002	泰卢固语	0.09
北索托语	0.0002	马拉雅拉姆语	0.10
丰语	0.0002	乌尔都语	0.10
基隆迪语	0.0003	泰米尔语	0.20
沃洛夫语	0.0004	孟加拉语	0.50
卢干达语	0.0004	印地语	0.70
奇绍纳语	0.001
伊西祖鲁语	0.001
伊博语	0.001
科萨语	0.001
基尼亚卢旺达语	0.003
约鲁巴语	0.006
斯瓦希里语	0.02

下表显示了编程语言的分布情况。

点击展开

扩展名	语言	文件数量
java	Java	5,407,724
php	PHP	4,942,186
cpp	C++	2,503,930
py	Python	2,435,072
js	JavaScript	1,905,518
cs	C#	1,577,347
rb	Ruby	678,413
cc	C++	443,054
hpp	C++	391,048
lua	Lua	352,317
go	GO	227,763
ts	TypeScript	195,254
C	C	134,537
scala	Scala	92,052
hh	C++	67,161
H	C++	55,899
tsx	TypeScript	33,107
rs	Rust	29,693
phpt	PHP	9,702
c++	C++	1,342
h++	C++	791
php3	PHP	540
phps	PHP	270
php5	PHP	166
php4	PHP	29

风险与局限性

本节指出可预见的危害和误解。

点击展开

模型可能会：

过度呈现某些观点，而对其他观点呈现不足
包含刻板印象
包含个人信息
生成以下内容：
- 仇恨、辱骂或暴力语言
- 歧视性或偏见性语言
- 可能不适用于所有场景的内容，包括性内容
出现错误，包括将不正确的信息当作事实输出
生成不相关或重复的输出

评估

本节描述评估协议并提供结果。

点击展开

指标

本节描述计算性能的不同方式及其原因。

包括：

指标	选择原因
Perplexity	用于量化训练期间模型改进的标准指标
Cross Entropy Loss	语言模型的标准目标函数。

以及针对特定任务的多种不同指标。（更多评估指标将在评估协议完成后推出。）

影响因素

本节列出BLOOM模型的一些不同方面。重点关注可能导致模型行为出现高方差的方面。

语言，例如英语或约鲁巴语
领域，例如新闻专线或故事
人口统计特征，例如性别或国籍

结果

结果基于影响因素和指标。

零样本评估：

JSON文件请参见此仓库：https://github.com/bigscience-workshop/evaluation-results

任务	语言	指标	BLOOM-2B5
arc_challenge	eng	acc ↑	0.28
arc_easy	eng	acc ↑	0.595
axb (10个提示的中位数)	eng	acc ↑	0.443
axg (10个提示的中位数)	eng	acc ↑	0.5
boolq (11个提示的中位数)	eng	acc ↑	0.617
cb (15个提示的中位数)	eng	acc ↑	0.304
cola (5个提示的中位数)	eng	acc ↑	0.611
copa (9个提示的中位数)	eng	acc ↑	0.63
crows_pairs_english (6个提示的中位数)	eng	acc ↑	0.497
crows_pairs_french (7个提示的中位数)	fra	acc ↑	0.503
diabla (2个提示的中位数)	eng	acc ↑	0.289
gsarti/flores_101_afr	afr	byte_perplexity ↓	6.501
gsarti/flores_101_amh	amh	byte_perplexity ↓	3.973
gsarti/flores_101_ara	ara	byte_perplexity ↓	1.808
gsarti/flores_101_asm	asm	byte_perplexity ↓	5.699
gsarti/flores_101_ast	ast	byte_perplexity ↓	3.925
gsarti/flores_101_azj	azj	byte_perplexity ↓	6.943
gsarti/flores_101_bel	bel	byte_perplexity ↓	3.614
gsarti/flores_101_ben	ben	byte_perplexity ↓	5.121
gsarti/flores_101_bos	bos	byte_perplexity ↓	5.653
gsarti/flores_101_bul	bul	byte_perplexity ↓	2.701
gsarti/flores_101_cat	cat	byte_perplexity ↓	2.305
gsarti/flores_101_ceb	ceb	byte_perplexity ↓	6.291
gsarti/flores_101_ces	ces	byte_perplexity ↓	5.447
gsarti/flores_101_ckb	ckb	byte_perplexity ↓	3.726
gsarti/flores_101_cym	cym	byte_perplexity ↓	12.539
gsarti/flores_101_dan	dan	byte_perplexity ↓	5.183
gsarti/flores_101_deu	deu	byte_perplexity ↓	3.118
gsarti/flores_101_ell	ell	byte_perplexity ↓	2.468
gsarti/flores_101_eng	eng	byte_perplexity ↓	2.019
gsarti/flores_101_est	est	byte_perplexity ↓	9.117
gsarti/flores_101_fas	fas	byte_perplexity ↓	3.058
gsarti/flores_101_fin	fin	byte_perplexity ↓	6.847
gsarti/flores_101_fra	fra	byte_perplexity ↓	1.998
gsarti/flores_101_ful	ful	byte_perplexity ↓	11.466
gsarti/flores_101_gle	gle	byte_perplexity ↓	8.681
gsarti/flores_101_glg	glg	byte_perplexity ↓	3.03
gsarti/flores_101_guj	guj	byte_perplexity ↓	4.955
gsarti/flores_101_hau	hau	byte_perplexity ↓	10.758
gsarti/flores_101_heb	heb	byte_perplexity ↓	3.6
gsarti/flores_101_hin	hin	byte_perplexity ↓	4.713
gsarti/flores_101_hrv	hrv	byte_perplexity ↓	5.822
gsarti/flores_101_hun	hun	byte_perplexity ↓	6.44
gsarti/flores_101_hye	hye	byte_perplexity ↓	3.658
gsarti/flores_101_ibo	ibo	byte_perplexity ↓	5.565
gsarti/flores_101_ind	ind	byte_perplexity ↓	2.16
gsarti/flores_101_isl	isl	byte_perplexity ↓	8.082
gsarti/flores_101_ita	ita	byte_perplexity ↓	2.969
gsarti/flores_101_jav	jav	byte_perplexity ↓	7.057
gsarti/flores_101_jpn	jpn	byte_perplexity ↓	2.776
gsarti/flores_101_kam	kam	byte_perplexity ↓	11.073
gsarti/flores_101_kan	kan	byte_perplexity ↓	5.552
gsarti/flores_101_kat	kat	byte_perplexity ↓	2.523
gsarti/flores_101_kaz	kaz	byte_perplexity ↓	3.39
gsarti/flores_101_kea	kea	byte_perplexity ↓	8.919
gsarti/flores_101_kir	kir	byte_perplexity ↓	3.729
gsarti/flores_101_kor	kor	byte_perplexity ↓	3.933
gsarti/flores_101_lao	lao	byte_perplexity ↓	2.908
gsarti/flores_101_lav	lav	byte_perplexity ↓	7.777
gsarti/flores_101_lin	lin	byte_perplexity ↓	7.525
gsarti/flores_101_lit	lit	byte_perplexity ↓	7.369
gsarti/flores_101_ltz	ltz	byte_perplexity ↓	8.801
gsarti/flores_101_lug	lug	byte_perplexity ↓	8.483
gsarti/flores_101_luo	luo	byte_perplexity ↓	11.976
gsarti/flores_101_mal	mal	byte_perplexity ↓	4.616
gsarti/flores_101_mar	mar	byte_perplexity ↓	5.483
gsarti/flores_101_mkd	mkd	byte_perplexity ↓	2.966
gsarti/flores_101_mlt	mlt	byte_perplexity ↓	15.005
gsarti/flores_101_mon	mon	byte_perplexity ↓	3.411
gsarti/flores_101_mri	mri	byte_perplexity ↓	7.474
gsarti/flores_101_msa	msa	byte_perplexity ↓	2.571
gsarti/flores_101_mya	mya	byte_perplexity ↓	2.414
gsarti/flores_101_nld	nld	byte_perplexity ↓	4.128
gsarti/flores_101_nob	nob	byte_perplexity ↓	5.403
gsarti/flores_101_npi	npi	byte_perplexity ↓	5.199
gsarti/flores_101_nso	nso	byte_perplexity ↓	8.155
gsarti/flores_101_nya	nya	byte_perplexity ↓	8.18
gsarti/flores_101_oci	oci	byte_perplexity ↓	4.862
gsarti/flores_101_orm	orm	byte_perplexity ↓	12.912
gsarti/flores_101_ory	ory	byte_perplexity ↓	5.189
gsarti/flores_101_pan	pan	byte_perplexity ↓	4.698
gsarti/flores_101_pol	pol	byte_perplexity ↓	4.626
gsarti/flores_101_por	por	byte_perplexity ↓	1.975
gsarti/flores_101_pus	pus	byte_perplexity ↓	4.496
gsarti/flores_101_ron	ron	byte_perplexity ↓	4.965
gsarti/flores_101_rus	rus	byte_perplexity ↓	2.05
gsarti/flores_101_slk	slk	byte_perplexity ↓	6.451
gsarti/flores_101_slv	slv	byte_perplexity ↓	6.62
gsarti/flores_101_sna	sna	byte_perplexity ↓	8.462
gsarti/flores_101_snd	snd	byte_perplexity ↓	5.466
gsarti/flores_101_som	som	byte_perplexity ↓	11.959
gsarti/flores_101_spa	spa	byte_perplexity ↓	1.897
gsarti/flores_101_srp	srp	byte_perplexity ↓	2.871
gsarti/flores_101_swe	swe	byte_perplexity ↓	5.055
gsarti/flores_101_swh	swh	byte_perplexity ↓	3.697
gsarti/flores_101_tam	tam	byte_perplexity ↓	4.539
gsarti/flores_101_tel	tel	byte_perplexity ↓	5.807
gsarti/flores_101_tgk	tgk	byte_perplexity ↓	3.599
gsarti/flores_101_tgl	tgl	byte_perplexity ↓	5.667
gsarti/flores_101_tha	tha	byte_perplexity ↓	2.366
gsarti/flores_101_tur	tur	byte_perplexity ↓	4.885
gsarti/flores_101_ukr	ukr	byte_perplexity ↓	2.724
gsarti/flores_101_umb	umb	byte_perplexity ↓	12.767
gsarti/flores_101_urd	urd	byte_perplexity ↓	1.98
gsarti/flores_101_uzb	uzb	byte_perplexity ↓	12.002
gsarti/flores_101_vie	vie	byte_perplexity ↓	1.766
gsarti/flores_101_wol	wol	byte_perplexity ↓	9.144
gsarti/flores_101_xho	xho	byte_perplexity ↓	7.403
gsarti/flores_101_yor	yor	byte_perplexity ↓	5.913
gsarti/flores_101_zho_simpl	zho_simpl	byte_perplexity ↓	2.277
gsarti/flores_101_zho_trad	zho_trad	byte_perplexity ↓	2.518
gsarti/flores_101_zul	zul	byte_perplexity ↓	8.534
headqa	esp	acc ↑	0.264
hellaswag	eng	acc ↑	0.412
logiqa	eng	acc ↑	0.207
mathqa	eng	acc ↑	0.25
mc_taco	eng	em ↑	0.119
mnli (15个提示的中位数)	eng	acc ↑	0.355
mnli_mismatched (15个提示的中位数)	eng	acc ↑	0.352
mrpc	eng	acc ↑	0.586
multirc (11个提示的中位数)	eng	acc ↑	0.538
openbookqa	eng	acc ↑	0.216
piqa	eng	acc ↑	0.708
prost	eng	acc ↑	0.227
pubmedqa	eng	acc ↑	0.616
qnli	eng	acc ↑	0.507
qqp (7个提示的中位数)	eng	acc ↑	0.384
race	eng	acc ↑	0.352
rte (6个提示的中位数)	eng	acc ↑	0.477
sciq	eng	acc ↑	0.892
sst (6个提示的中位数)	eng	acc ↑	0.518
triviaqa	eng	acc ↑	0.042
tydiqa_primary (24个提示的中位数)	eng	acc ↑	0.301
webqs	eng	acc ↑	0.017
wic (11个提示的中位数)	eng	acc ↑	0.502
winogrande	eng	acc ↑	0.586
wnli (6个提示的中位数)	eng	acc ↑	0.472
wsc (11个提示的中位数)	eng	acc ↑	0.442
humaneval	python	pass@1 ↑	0.155
humaneval	python	pass@10 ↑	0.322
humaneval	python	pass@100 ↑	0.555

训练时评估：

截至2022年5月25日，太平洋标准时间15:00：

训练损失：2.0
验证损失：2.2
困惑度：8.9

建议

本节提供有关警告和潜在缓解措施的信息。

点击展开

应让间接用户知晓其处理的内容是由LLM生成的。
用户应了解风险与局限性，并在必要时添加适当的年龄免责声明或设置访问限制界面。
使用该LLM预训练的模型应包含更新后的模型卡片（Model Card）。
模型用户应提供让受影响者反馈的渠道，例如用于接收意见的电子邮箱。

术语表与计算方法

本节定义常用术语及指标的计算方式。

点击展开

损失（Loss）： 用于计算模型已学习内容与数据所示内容（“真实值”）之间的差异。损失值越低越好。训练过程旨在最小化损失。
困惑度（Perplexity）： 基于模型对新数据出现概率的估计。困惑度越低越好。如果模型能100%准确预测下一个将要出现的标记，那么困惑度为1。其数学计算基于熵值。
高风险场景： 例如欧盟拟议的《人工智能法案》（Artificial Intelligence (AI) Act）中定义的“高风险AI系统”和“不可接受风险AI系统”。
关键决策： 例如美国拟议的《算法问责法案》（the United States' proposed Algorithmic Accountability Act）中定义的决策。
人权： 包括《世界人权宣言》（Universal Declaration of Human Rights）中定义的各项权利。
个人数据与个人信息： 个人数据和个人信息在多项数据保护法规中均有定义，例如欧盟《通用数据保护条例》（European Union's General Data Protection Regulation）中的“个人数据”；南非共和国《个人信息保护法》（Protection of Personal Information Act）、中华人民共和国《个人信息保护法》（Personal information protection law）中的“个人信息”。
敏感特征： 包括人权（参见《世界人权宣言》第2条，UHDR, Article 2）和个人信息法规（参见《通用数据保护条例》第9条；《个人信息保护法》第一章，Article 9; Protection of Personal Information Act, Chapter 1）中特别保护的类别。
欺骗： 故意误导他人相信虚假事物的行为，例如在社交媒体上创建冒充真人的僵尸账号或聊天机器人，或生成文本文件却不告知消费者该文本为机器生成。

模型卡片作者

大致按时间顺序和投入时间排序。

BLOOM LM

大型科学开放科学开放获取多语言语言模型

模型卡片

目录

修改记录

模型详情

快速入门

基础信息

技术规格

训练

分词

环境影响

用途

预期用途

直接用途

下游用途

滥用和超出范围的用途

超出范围的用途

超出范围的用途包括：

滥用

目标用户

直接用户

间接用户

其他受影响方（利益相关者）

训练数据

语言

风险与局限性

评估

指标

影响因素

结果

建议

术语表与计算方法

更多信息

数据集创建

技术规格

初步结果

模型卡片作者

BLOOM LM

大型科学开放科学开放获取多语言语言模型

模型卡片

目录

修改记录

模型详情

快速入门

基础信息

技术规格

训练

分词

环境影响

用途

预期用途

直接用途

下游用途

滥用和超出范围的用途

超出范围的用途

超出范围的用途包括：

滥用

目标用户

直接用户

间接用户

其他受影响方（利益相关者）

训练数据

语言

风险与局限性

评估

指标

影响因素

结果

建议

术语表与计算方法

更多信息

数据集创建

技术规格

初步结果

模型卡片作者