Model Card for Switch Transformers Large - 128 experts

模型架构示意图

摘要

Switch Transformers 是基于掩码语言建模（MLM）任务训练的混合专家（MoE）模型。其架构与经典 T5 模型相似，但前馈层被替换为包含"专家"MLP 的稀疏多层感知机层。根据原论文所述，该模型在微调任务上表现优于 T5 的同时实现了更快的训练速度（扩展性优势）。如摘要开篇所述：

我们通过在"超大规模清洁爬取语料库"上预训练万亿参数模型，将当前语言模型规模推向新高度，相比 T5-XXL 模型实现了 4 倍加速。

免责声明：本模型卡内容由 Hugging Face 团队撰写，部分内容复制自原论文。

模型详情

模型描述

模型类型： 语言模型
支持语言（NLP）： 英语
许可协议： Apache 2.0
关联模型： 所有 Switch Transformers 检查点
原始检查点： 所有原始 Switch Transformers 检查点
扩展阅读资源：

使用方法

请注意，这些检查点已针对掩码语言建模（MLM）任务进行训练。因此，这些检查点并非直接适用于下游任务。您可能需要查看 FLAN-T5 以使用微调后的权重，或按照此笔记本微调您自己的MoE模型。

以下是在 transformers 中使用该模型的一些示例脚本：

使用PyTorch模型

在CPU上运行模型

点击展开


from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-large-128")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-large-128")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在GPU上运行模型

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-large-128")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-large-128", device_map="auto")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在GPU上以不同精度运行模型

FP16（半精度浮点）

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-large-128")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-large-128", device_map="auto", torch_dtype=torch.float16)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

INT8

点击展开

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-large-128")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-large-128", device_map="auto")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

用途

直接使用与下游应用

详见研究论文获取更多信息。

偏差、风险与局限性

需要更多信息。

伦理考量与风险

需要更多信息。

已知局限性

需要更多信息。

敏感用途

需要更多信息。

训练详情

训练数据

该模型基于掩码语言建模任务进行训练，采用Colossal Clean Crawled Corpus (C4)数据集，训练流程与T5保持一致。

训练流程

根据原始论文的模型卡片所述，本模型使用TPU v3或TPU v4计算集群进行训练，采用t5x代码库结合jax框架实现。

评估

测试数据、因素与指标

研究团队在多项任务上评估模型性能，并与T5进行对比。部分量化评估结果参见下表：完整细节请查阅研究论文。

结果

Switch Transformers的完整实验结果请参见研究论文表5。

环境影响

碳排放量可通过Lacoste等人(2019)提出的机器学习影响计算器进行估算。

硬件类型： Google Cloud TPU集群 - TPU v3或TPU v4 | 芯片数量≥4
使用时长： 需要更多信息
云服务商： GCP
计算区域： 需要更多信息
碳排放量： 需要更多信息

引用

BibTeX：

@misc{https://doi.org/10.48550/arxiv.2101.03961,
  doi = {10.48550/ARXIV.2101.03961},
  
  url = {https://arxiv.org/abs/2101.03961},
  
  author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
  
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  
  publisher = {arXiv},
  
  year = {2021},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Model Card for Switch Transformers Large - 128 experts

模型架构示意图

摘要

我们通过在"超大规模清洁爬取语料库"上预训练万亿参数模型，将当前语言模型规模推向新高度，相比 T5-XXL 模型实现了 4 倍加速。

免责声明：本模型卡内容由 Hugging Face 团队撰写，部分内容复制自原论文。

模型详情

模型描述

模型类型： 语言模型
支持语言（NLP）： 英语
许可协议： Apache 2.0
关联模型： 所有 Switch Transformers 检查点
原始检查点： 所有原始 Switch Transformers 检查点
扩展阅读资源：

使用方法

以下是在 transformers 中使用该模型的一些示例脚本：

使用PyTorch模型

在CPU上运行模型

点击展开


from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-large-128")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-large-128")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在GPU上运行模型

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-large-128")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-large-128", device_map="auto")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在GPU上以不同精度运行模型

FP16（半精度浮点）

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-large-128")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-large-128", device_map="auto", torch_dtype=torch.float16)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

INT8

点击展开

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-large-128")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-large-128", device_map="auto")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

用途

直接使用与下游应用

详见研究论文获取更多信息。

偏差、风险与局限性

需要更多信息。

伦理考量与风险

需要更多信息。

已知局限性

需要更多信息。

敏感用途

需要更多信息。

训练详情

训练数据

该模型基于掩码语言建模任务进行训练，采用Colossal Clean Crawled Corpus (C4)数据集，训练流程与T5保持一致。

训练流程

根据原始论文的模型卡片所述，本模型使用TPU v3或TPU v4计算集群进行训练，采用t5x代码库结合jax框架实现。

评估

测试数据、因素与指标

研究团队在多项任务上评估模型性能，并与T5进行对比。部分量化评估结果参见下表：完整细节请查阅研究论文。

结果

Switch Transformers的完整实验结果请参见研究论文表5。

环境影响

碳排放量可通过Lacoste等人(2019)提出的机器学习影响计算器进行估算。

硬件类型： Google Cloud TPU集群 - TPU v3或TPU v4 | 芯片数量≥4
使用时长： 需要更多信息
云服务商： GCP
计算区域： 需要更多信息
碳排放量： 需要更多信息

引用

BibTeX：

@misc{https://doi.org/10.48550/arxiv.2101.03961,
  doi = {10.48550/ARXIV.2101.03961},
  
  url = {https://arxiv.org/abs/2101.03961},
  
  author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
  
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  
  publisher = {arXiv},
  
  year = {2021},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Model Card for Switch Transformers Large - 128 experts

目录

摘要

模型详情

模型描述

使用方法

使用PyTorch模型

在CPU上运行模型

在GPU上运行模型

在GPU上以不同精度运行模型

FP16（半精度浮点）

INT8

用途

直接使用与下游应用

偏差、风险与局限性

伦理考量与风险

已知局限性

敏感用途

训练详情

训练数据

训练流程

评估

测试数据、因素与指标

结果

环境影响

引用

Model Card for Switch Transformers Large - 128 experts

目录

摘要

模型详情

模型描述

使用方法

使用PyTorch模型

在CPU上运行模型

在GPU上运行模型

在GPU上以不同精度运行模型

FP16（半精度浮点）

INT8

用途

直接使用与下游应用

偏差、风险与局限性

伦理考量与风险

已知局限性

敏感用途

训练详情

训练数据

训练流程

评估

测试数据、因素与指标

结果

环境影响

引用