Switch Transformers Base - 32 experts 模型卡片

摘要

Switch Transformers 是一种基于混合专家（Mixture of Experts, MoE）架构的模型，通过掩码语言建模（Masked Language Modeling, MLM）任务进行训练。该模型架构与经典的 T5 类似，但其前馈层被包含“专家”MLP 的稀疏 MLP 层所取代。根据原始论文，该模型能够实现更快的训练（具备良好的缩放特性），同时在微调任务上表现优于 T5。

正如摘要开头所述：

我们通过在“Colossal Clean Crawled Corpus”上预训练高达万亿参数的模型，推进了当前语言模型的规模，并实现了比 T5-XXL 模型快 4 倍的速度。

免责声明：本模型卡片的内容由 Hugging Face 团队撰写，部分内容摘取自原始论文。

模型详情

模型描述

模型类型：语言模型
语言（自然语言处理）：英语
许可证：Apache 2.0
相关模型：[All Switch Transformers Checkpoints]
原始检查点：[All Original Switch Transformers Checkpoints]
更多信息资源：
- 研究论文
- GitHub 仓库
- [Hugging Face Switch Transformers 文档（与 T5 类似）]

使用方法

请注意，这些检查点是在掩码语言建模（MLM）任务上训练的。因此，这些检查点并非“即开即用”的下游任务模型。如果您需要运行微调后的权重，可查看 FLAN-T5，或者按照此笔记本的指导微调您自己的 MoE 模型。

以下是一些使用 transformers 库调用模型的示例脚本：

使用 Pytorch 模型

在 CPU 上运行模型

点击展开


from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from openmind_hub import snapshot_download

# 下载模型
model_path = snapshot_download("LF_AICC/switch-base-32")

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = SwitchTransformersForConditionalGeneration.from_pretrained(model_path)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在 GPU 上运行模型

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from openmind_hub import snapshot_download

# 下载模型
model_path = snapshot_download("LF_AICC/switch-base-32")

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = SwitchTransformersForConditionalGeneration.from_pretrained(model_path, device_map="npu:0")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("npu:0")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在 GPU 上以不同精度运行模型

FP16

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from openmind_hub import snapshot_download

# 下载模型
model_path = snapshot_download("LF_AICC/switch-base-32")

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = SwitchTransformersForConditionalGeneration.from_pretrained(model_path, device_map="npu:0", torch_dtype=torch.float16)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("npu:0")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

INT8

点击展开

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from openmind_hub import snapshot_download

# 下载模型
model_path = snapshot_download("LF_AICC/switch-base-32")

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = SwitchTransformersForConditionalGeneration.from_pretrained(model_path, device_map="npu:0")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("npu:0")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

用途

直接用途和下游用途

有关更多详细信息，请参见研究论文。

超出范围的用途

需要更多信息。

偏差、风险与局限性

需要更多信息。

伦理考量和风险

需要更多信息。

已知局限性

需要更多信息。

敏感用途：

需要更多信息。

训练详情

训练数据

该模型在 Colossal Clean Crawled Corpus（C4）数据集上针对掩码语言建模任务进行训练，遵循与 T5 相同的流程。

训练过程

根据原始论文中的模型卡片，该模型已在 TPU v3 或 TPU v4 pods 上训练，使用 t5x 代码库以及 jax。

评估

测试数据、因素与指标

作者在各种任务上对模型进行了评估，并将结果与 T5 进行了比较。以下表格提供了部分定量评估结果：完整详情请查阅研究论文。

结果

有关 Switch Transformers 的完整结果，请参见研究论文表 5。

环境影响

可使用 Lacoste et al. (2019) 中提出的机器学习影响计算器来估算碳排放。

硬件类型： Google Cloud TPU Pods - TPU v3 或 TPU v4 | 芯片数量 ≥ 4。
使用时长： 需要更多信息
云服务提供商： GCP
计算区域： 需要更多信息
碳排放： 需要更多信息

引用

BibTeX：

@misc{https://doi.org/10.48550/arxiv.2101.03961,
  doi = {10.48550/ARXIV.2101.03961},
  
  url = {https://arxiv.org/abs/2101.03961},
  
  author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
  
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  
  publisher = {arXiv},
  
  year = {2021},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Switch Transformers Base - 32 experts 模型卡片

摘要

正如摘要开头所述：

我们通过在“Colossal Clean Crawled Corpus”上预训练高达万亿参数的模型，推进了当前语言模型的规模，并实现了比 T5-XXL 模型快 4 倍的速度。

免责声明：本模型卡片的内容由 Hugging Face 团队撰写，部分内容摘取自原始论文。

模型详情

模型描述

模型类型：语言模型
语言（自然语言处理）：英语
许可证：Apache 2.0
相关模型：[All Switch Transformers Checkpoints]
原始检查点：[All Original Switch Transformers Checkpoints]
更多信息资源：
- 研究论文
- GitHub 仓库
- [Hugging Face Switch Transformers 文档（与 T5 类似）]

使用方法

以下是一些使用 transformers 库调用模型的示例脚本：

使用 Pytorch 模型

在 CPU 上运行模型

点击展开


from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from openmind_hub import snapshot_download

# 下载模型
model_path = snapshot_download("LF_AICC/switch-base-32")

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = SwitchTransformersForConditionalGeneration.from_pretrained(model_path)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在 GPU 上运行模型

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from openmind_hub import snapshot_download

# 下载模型
model_path = snapshot_download("LF_AICC/switch-base-32")

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = SwitchTransformersForConditionalGeneration.from_pretrained(model_path, device_map="npu:0")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("npu:0")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

在 GPU 上以不同精度运行模型

FP16

点击展开

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from openmind_hub import snapshot_download

# 下载模型
model_path = snapshot_download("LF_AICC/switch-base-32")

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = SwitchTransformersForConditionalGeneration.from_pretrained(model_path, device_map="npu:0", torch_dtype=torch.float16)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("npu:0")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

INT8

点击展开

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from openmind_hub import snapshot_download

# 下载模型
model_path = snapshot_download("LF_AICC/switch-base-32")

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = SwitchTransformersForConditionalGeneration.from_pretrained(model_path, device_map="npu:0")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("npu:0")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

用途

直接用途和下游用途

有关更多详细信息，请参见研究论文。

超出范围的用途

需要更多信息。

偏差、风险与局限性

需要更多信息。

伦理考量和风险

需要更多信息。

已知局限性

需要更多信息。

敏感用途：

需要更多信息。

训练详情

训练数据

该模型在 Colossal Clean Crawled Corpus（C4）数据集上针对掩码语言建模任务进行训练，遵循与 T5 相同的流程。

训练过程

根据原始论文中的模型卡片，该模型已在 TPU v3 或 TPU v4 pods 上训练，使用 t5x 代码库以及 jax。

评估

测试数据、因素与指标

作者在各种任务上对模型进行了评估，并将结果与 T5 进行了比较。以下表格提供了部分定量评估结果：完整详情请查阅研究论文。

结果

有关 Switch Transformers 的完整结果，请参见研究论文表 5。

环境影响

可使用 Lacoste et al. (2019) 中提出的机器学习影响计算器来估算碳排放。

硬件类型： Google Cloud TPU Pods - TPU v3 或 TPU v4 | 芯片数量 ≥ 4。
使用时长： 需要更多信息
云服务提供商： GCP
计算区域： 需要更多信息
碳排放： 需要更多信息

引用

BibTeX：

@misc{https://doi.org/10.48550/arxiv.2101.03961,
  doi = {10.48550/ARXIV.2101.03961},
  
  url = {https://arxiv.org/abs/2101.03961},
  
  author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
  
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  
  publisher = {arXiv},
  
  year = {2021},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Switch Transformers Base - 32 experts 模型卡片

目录

摘要

模型详情

模型描述

使用方法

使用 Pytorch 模型

在 CPU 上运行模型

在 GPU 上运行模型

在 GPU 上以不同精度运行模型

FP16

INT8

用途

直接用途和下游用途

超出范围的用途

偏差、风险与局限性

伦理考量和风险

已知局限性

敏感用途：

训练详情

训练数据

训练过程

评估

测试数据、因素与指标

结果

环境影响

引用

Switch Transformers Base - 32 experts 模型卡片

目录

摘要

模型详情

模型描述

使用方法

使用 Pytorch 模型

在 CPU 上运行模型

在 GPU 上运行模型

在 GPU 上以不同精度运行模型

FP16

INT8

用途

直接用途和下游用途

超出范围的用途

偏差、风险与局限性

伦理考量和风险

已知局限性

敏感用途：

训练详情

训练数据

训练过程

评估

测试数据、因素与指标

结果

环境影响

引用