bert-base-romanian-cased-v1

适用于罗马尼亚语的 BERT 基础版、大小写敏感模型，基于 15GB 语料库训练而成，版本为 v1.0

使用方法

from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

请务必始终对文本进行清洗！使用以下方法将带 cedilla 的 s 和 t 字母替换为带逗号的字母：

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

因为该模型未针对带软音符的 s 和 t 进行训练。如果不这样做，由于 <UNK> 的出现以及每个单词的标记数量增加，性能将会下降。

评估

评估在 Universal Dependencies Romanian RRT 的 UPOS、XPOS 和 LAS 上进行，并基于 RONEC 进行命名实体识别（NER）任务。详细信息以及此处未展示的更深入测试，请参见专门的评估页面。

基线模型为 Multilingual BERT 模型 bert-base-multilingual-(un)cased，因为在撰写本文时，它是唯一可用的可处理罗马尼亚语的 BERT 模型。

模型	UPOS	XPOS	NER	LAS
bert-base-multilingual-cased	97.87	96.16	84.13	88.04
bert-base-romanian-cased-v1	98.00	96.46	85.88	89.69

语料库

该模型在以下语料库上进行训练（下表中的统计数据为清洗后的数据）：

语料库	行数(百万)	词数(百万)	字符数(十亿)	大小(GB)
OPUS	55.05	635.04	4.045	3.8
OSCAR	33.56	1725.82	11.411	11
Wikipedia	1.54	60.47	0.411	0.4
总计	90.15	2421.33	15.867	15.2

引用

如果您在研究论文中使用此模型，恳请您引用以下论文：

Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.

或者，在 bibtex 中：

@inproceedings{dumitrescu-etal-2020-birth,
    title = "The birth of {R}omanian {BERT}",
    author = "Dumitrescu, Stefan  and
      Avram, Andrei-Marius  and
      Pyysalo, Sampo",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.findings-emnlp.387",
    doi = "10.18653/v1/2020.findings-emnlp.387",
    pages = "4324--4328",
}

致谢

我们要感谢来自TurkuNLP的Sampo Pyysalo，他为我们提供了预训练v1.0 BERT模型所需的计算资源。他非常棒！