XLM-RoBERTa-XL（超大型模型）

XLM-RoBERTa-XL 模型在包含 100 种语言的 2.5TB 过滤后 CommonCrawl 数据上进行预训练。它由 Naman Goyal、Jingfei Du、Myle Ott、Giri Anantharaman、Alexis Conneau 在论文《Larger-Scale Transformers for Multilingual Masked Language Modeling》（https://arxiv.org/abs/2105.00572）中提出，并首次在此仓库发布。

免责声明：发布 XLM-RoBERTa-XL 的团队未为此模型编写模型卡片，因此本模型卡片由 Hugging Face 团队编写。

模型描述

XLM-RoBERTa-XL 是 RoBERTa 的超大型多语言版本。它在包含 100 种语言的 2.5TB 过滤后 CommonCrawl 数据上进行预训练。

RoBERTa 是一种在大型语料库上以自监督方式预训练的 transformers 模型。这意味着它仅在原始文本上进行预训练，无需人工以任何方式进行标注（这也是它能够利用大量公开可用数据的原因），并通过自动流程从这些文本生成输入和标签。

更准确地说，它是通过掩码语言模型（MLM）目标进行预训练的。模型会获取一个句子，随机掩盖输入中 15% 的词，然后将整个被掩盖的句子输入模型，模型需要预测被掩盖的词。这与传统的循环神经网络（RNNs）通常逐个处理词，或者像 GPT 这样的自回归模型在内部掩盖未来标记的方式不同。这种方式使模型能够学习句子的双向表示。

通过这种方式，模型学习 100 种语言的内部表示，这些表示可用于提取对下游任务有用的特征：例如，如果你有一个带标签的句子数据集，你可以使用 XLM-RoBERTa-XL 模型生成的特征作为输入来训练标准分类器。

预期用途和限制

你可以将原始模型用于掩码语言建模，但它主要旨在针对下游任务进行微调。请查看模型中心，寻找你感兴趣的任务的微调版本。

请注意，此模型主要用于在需要使用整个句子（可能被掩码）进行决策的任务上进行微调，例如序列分类、标记分类或问答。对于文本生成等任务，你应该考虑 GPT2 等模型。

使用方法

您可以直接通过掩码语言建模的管道使用此模型：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='facebook/xlm-roberta-xxl')
>>> unmasker("Europe is a <mask> continent.")

[{'score': 0.22996895015239716,
  'token': 28811,
  'token_str': 'European',
  'sequence': 'Europe is a European continent.'},
 {'score': 0.14307449758052826,
  'token': 21334,
  'token_str': 'large',
  'sequence': 'Europe is a large continent.'},
 {'score': 0.12239163368940353,
  'token': 19336,
  'token_str': 'small',
  'sequence': 'Europe is a small continent.'},
 {'score': 0.07025063782930374,
  'token': 18410,
  'token_str': 'vast',
  'sequence': 'Europe is a vast continent.'},
 {'score': 0.032869212329387665,
  'token': 6957,
  'token_str': 'big',
  'sequence': 'Europe is a big continent.'}]

以下是如何在 PyTorch 中使用此模型获取给定文本特征的方法：

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('facebook/xlm-roberta-xxl')
model = AutoModelForMaskedLM.from_pretrained("facebook/xlm-roberta-xxl")

# prepare input
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input)

BibTeX 条目和引用信息

@article{DBLP:journals/corr/abs-2105-00572,
  author    = {Naman Goyal and
               Jingfei Du and
               Myle Ott and
               Giri Anantharaman and
               Alexis Conneau},
  title     = {Larger-Scale Transformers for Multilingual Masked Language Modeling},
  journal   = {CoRR},
  volume    = {abs/2105.00572},
  year      = {2021},
  url       = {https://arxiv.org/abs/2105.00572},
  eprinttype = {arXiv},
  eprint    = {2105.00572},
  timestamp = {Wed, 12 May 2021 15:54:31 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2105-00572.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}