Switch Transformers Base - 8 专家模型卡片

模型图片

摘要

Switch Transformers 是基于掩码语言建模（MLM）任务训练的混合专家（MoE）模型。其架构与经典 T5 模型相似，但前馈层被替换为包含"专家"MLP 的稀疏多层感知机层。根据原论文，该模型在微调任务上表现优于 T5 的同时实现了更快的训练速度（扩展性优势）。如摘要开篇所述：

我们通过在"超大规模清洁爬取语料库"上预训练万亿参数模型，将当前语言模型规模推向新高度，相比 T5-XXL 模型实现了 4 倍加速。

免责声明：本模型卡片内容由 Hugging Face 团队撰写，部分内容直接引自原论文。

模型详情

模型描述

模型类型： 语言模型
支持语言（NLP）： 英语
许可协议： Apache 2.0
关联模型： 所有 Switch Transformers 检查点
原始检查点： 所有原始 Switch Transformers 检查点
更多信息参考资源：

使用说明

请注意，这些检查点是基于掩码语言建模（MLM）任务训练的。因此，这些检查点不能直接用于下游任务。您可能需要查看 FLAN-T5 来运行微调后的权重，或按照此笔记本微调您自己的 MoE 模型。

以下是一些在 transformers 中使用该模型的示例脚本：

使用 PyTorch 模型

在 CPU 上运行模型

点击展开


from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-8")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在GPU上运行模型

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-8", device_map="auto")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在GPU上以不同精度运行模型

FP16（半精度浮点）

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-8", device_map="auto", torch_dtype=torch.float16)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

INT8

点击展开

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-8", device_map="auto")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

用途

直接使用与下游应用

详见研究论文获取更多信息。

超范围使用

需要更多信息。

偏见、风险与局限性

需要更多信息。

伦理考量与风险

需要更多信息。

已知局限性

需要更多信息。

敏感用途：

需要更多信息。

训练详情

训练数据

该模型基于掩码语言建模任务进行训练，使用Colossal Clean Crawled Corpus (C4)数据集，遵循与T5相同的训练流程。

训练过程

根据原始论文的模型卡片介绍，模型使用TPU v3或TPU v4计算集群进行训练，采用t5x代码库结合jax框架实现。

评估

测试数据、因素与指标

研究者在多项任务上评估模型性能，并与T5进行对比。部分量化评估结果参见下表：完整细节请查阅研究论文。

结果

Switch Transformers的完整结果请参见研究论文表5。

环境影响

碳排放量可通过Lacoste等人(2019)提出的机器学习影响计算器进行估算。

硬件类型： Google Cloud TPU集群 - TPU v3或TPU v4 | 芯片数量 ≥ 4
使用时长： 需要更多信息
云服务商： GCP
计算区域： 需要更多信息
碳排放量： 需要更多信息

引用

BibTeX：

@misc{https://doi.org/10.48550/arxiv.2101.03961,
  doi = {10.48550/ARXIV.2101.03961},
  
  url = {https://arxiv.org/abs/2101.03961},
  
  author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
  
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  
  publisher = {arXiv},
  
  year = {2021},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Switch Transformers Base - 8 专家模型卡片

模型图片

摘要

我们通过在"超大规模清洁爬取语料库"上预训练万亿参数模型，将当前语言模型规模推向新高度，相比 T5-XXL 模型实现了 4 倍加速。

免责声明：本模型卡片内容由 Hugging Face 团队撰写，部分内容直接引自原论文。

模型详情

模型描述

模型类型： 语言模型
支持语言（NLP）： 英语
许可协议： Apache 2.0
关联模型： 所有 Switch Transformers 检查点
原始检查点： 所有原始 Switch Transformers 检查点
更多信息参考资源：

使用说明

以下是一些在 transformers 中使用该模型的示例脚本：

使用 PyTorch 模型

在 CPU 上运行模型

点击展开


from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-8")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在GPU上运行模型

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-8", device_map="auto")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在GPU上以不同精度运行模型

FP16（半精度浮点）

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-8", device_map="auto", torch_dtype=torch.float16)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

INT8

点击展开

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-8", device_map="auto")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

用途

直接使用与下游应用

详见研究论文获取更多信息。

超范围使用

需要更多信息。

偏见、风险与局限性

需要更多信息。

伦理考量与风险

需要更多信息。

已知局限性

需要更多信息。

敏感用途：

需要更多信息。

训练详情

训练数据

该模型基于掩码语言建模任务进行训练，使用Colossal Clean Crawled Corpus (C4)数据集，遵循与T5相同的训练流程。

训练过程

根据原始论文的模型卡片介绍，模型使用TPU v3或TPU v4计算集群进行训练，采用t5x代码库结合jax框架实现。

评估

测试数据、因素与指标

研究者在多项任务上评估模型性能，并与T5进行对比。部分量化评估结果参见下表：完整细节请查阅研究论文。

结果

Switch Transformers的完整结果请参见研究论文表5。

环境影响

碳排放量可通过Lacoste等人(2019)提出的机器学习影响计算器进行估算。

硬件类型： Google Cloud TPU集群 - TPU v3或TPU v4 | 芯片数量 ≥ 4
使用时长： 需要更多信息
云服务商： GCP
计算区域： 需要更多信息
碳排放量： 需要更多信息

引用

BibTeX：

@misc{https://doi.org/10.48550/arxiv.2101.03961,
  doi = {10.48550/ARXIV.2101.03961},
  
  url = {https://arxiv.org/abs/2101.03961},
  
  author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
  
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  
  publisher = {arXiv},
  
  year = {2021},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Switch Transformers Base - 8 专家模型卡片

目录

摘要

模型详情

模型描述

使用说明

使用 PyTorch 模型

在 CPU 上运行模型

在GPU上运行模型

在GPU上以不同精度运行模型

FP16（半精度浮点）

INT8

用途

直接使用与下游应用

超范围使用

偏见、风险与局限性

伦理考量与风险

已知局限性

敏感用途：

训练详情

训练数据

训练过程

评估

测试数据、因素与指标

结果

环境影响

引用

Switch Transformers Base - 8 专家模型卡片

目录

摘要

模型详情

模型描述

使用说明

使用 PyTorch 模型

在 CPU 上运行模型

在GPU上运行模型

在GPU上以不同精度运行模型

FP16（半精度浮点）

INT8

用途

直接使用与下游应用

超范围使用

偏见、风险与局限性

伦理考量与风险

已知局限性

敏感用途：

训练详情

训练数据

训练过程

评估

测试数据、因素与指标

结果

环境影响

引用