GreekBERT

BERT 预训练语言模型的希腊语版本。

预训练语料库

bert-base-greek-uncased-v1 的预训练语料库包括：

维基百科的希腊语部分，
欧洲议会会议平行语料库的希腊语部分，以及
OSCAR的希腊语部分，这是Common Crawl的一个清理版本。

未来版本还将包括：

由国家出版办公室发布的完整希腊立法语料库，
在Eur-Lex上发布的完整欧盟立法（希腊语翻译）语料库。

预训练详情

我们使用 Google BERT GitHub 仓库（https://github.com/google-research/bert）中提供的官方代码训练 BERT。* 然后，我们使用 Hugging Face 的 Transformers 转换脚本，将 TF 检查点和词汇表转换为所需格式，以便 PyTorch 和 TF2 用户都能通过两行代码加载模型。
我们发布了一个与英语 bert-base-uncased 模型类似的模型（12 层、768 隐藏维度、12 个注意力头、1.1 亿参数）。
我们选择遵循相同的训练设置：100 万训练步，批次大小为 256 个长度为 512 的序列，初始学习率为 1e-4。
我们能够使用由 TensorFlow Research Cloud (TFRC) 免费提供的单个 Google Cloud TPU v3-8，同时还利用了 GCP 研究学分。非常感谢这两个 Google 项目对我们的支持！

* 您仍然可以从这个 Google Drive 文件夹访问原始的 TensorFlow 检查点。

要求

我们已将 bert-base-greek-uncased-v1 发布到 Hugging Face 的 Transformers 代码库中。因此，您需要通过 pip 安装 transformers 库以及 PyTorch 或 Tensorflow 2。

pip install transformers
pip install (torch|tensorflow)

文本预处理（去除重音 - 转小写）

注意： 默认分词器现已原生支持预处理功能。无需再添加以下代码。

使用bert-base-greek-uncased-v1时，需将文本预处理为小写字母并移除所有希腊语变音符号。


import unicodedata

def strip_accents_and_lowercase(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn').lower()

accented_string = "Αυτή είναι η Ελληνική έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase(accented_string)

print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert.

加载预训练模型

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
model = AutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")

将预训练模型用作语言模型

import torch
from transformers import *

# Load model and tokenizer
tokenizer_greek = AutoTokenizer.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')
lm_model_greek = AutoModelWithLMHead.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')

# ================ EXAMPLE 1 ================
text_1 = 'O ποιητής έγραψε ένα [MASK] .'
# EN: 'The poet wrote a [MASK].'
input_ids = tokenizer_greek.encode(text_1)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'o', 'ποιητης', 'εγραψε', 'ενα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 5].max(0)[1].item()))
# the most plausible prediction for [MASK] is "song"

# ================ EXAMPLE 2 ================
text_2 = 'Είναι ένας [MASK] άνθρωπος.'
# EN: 'He is a [MASK] person.'
input_ids = tokenizer_greek.encode(text_2)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 3].max(0)[1].item()))
# the most plausible prediction for [MASK] is "good"

# ================ EXAMPLE 3 ================
text_3 = 'Είναι ένας [MASK] άνθρωπος και κάνει συχνά [MASK].'
# EN: 'He is a [MASK] person he does frequently [MASK].'
input_ids = tokenizer_greek.encode(text_3)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', 'και', 'κανει', 'συχνα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 8].max(0)[1].item()))
# the most plausible prediction for the second [MASK] is "trips"

下游任务评估

有关详细结果，请阅读论文：

GREEK-BERT: The Greeks visiting Sesame Street. John Koutsikakis, Ilias Chalkidis, Prodromos Malakasiotis and Ion Androutsopoulos. In the Proceedings of the 11th Hellenic Conference on Artificial Intelligence (SETN 2020). Held Online. 2020. (https://arxiv.org/abs/2008.12014)

使用希腊NER数据集进行命名实体识别

模型名称	微平均F1值
BILSTM-CNN-CRF（Ma和Hovy，2016）	76.4 ± 2.07
M-BERT-UNCASED（Devlin等人，2019）	81.5 ± 1.77
M-BERT-CASED（Devlin等人，2019）	82.1 ± 1.35
XLM-R（Conneau等人，2020）	84.8 ± 1.50
GREEK-BERT（我们的模型）	85.7 ± 1.00

使用XNLI进行自然语言推理

模型名称	准确率
DAM（Parikh等人，2016）	68.5 ± 1.71
M-BERT-UNCASED（Devlin等人，2019）	73.9 ± 0.64
M-BERT-CASED（Devlin等人，2019）	73.5 ± 0.49
XLM-R（Conneau等人，2020）	77.3 ± 0.41
GREEK-BERT（我们的模型）	78.6 ± 0.62

作者

该模型随论文“GREEK-BERT: The Greeks visiting Sesame Street. John Koutsikakis, Ilias Chalkidis, Prodromos Malakasiotis and Ion Androutsopoulos. In the Proceedings of the 11th Hellenic Conference on Artificial Intelligence (SETN 2020). Held Online. 2020”（https://arxiv.org/abs/2008.12014）正式发布。

如果您使用该模型，请引用以下文献：

@inproceedings{greek-bert,
author = {Koutsikakis, John and Chalkidis, Ilias and Malakasiotis, Prodromos and Androutsopoulos, Ion},
title = {GREEK-BERT: The Greeks Visiting Sesame Street},
year = {2020},
isbn = {9781450388788},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3411408.3411440},
booktitle = {11th Hellenic Conference on Artificial Intelligence},
pages = {110–117},
numpages = {8},
location = {Athens, Greece},
series = {SETN 2020}
}

关于我们

AUEB自然语言处理小组致力于开发算法、模型和系统，使计算机能够处理和生成自然语言文本。

该小组当前的研究方向包括：

面向数据库、本体、文档集合和Web的问答系统，尤其是生物医学问答，
基于数据库和本体的自然语言生成，尤其是语义Web本体，
文本分类，包括垃圾邮件和不良内容过滤，
信息抽取和观点挖掘，包括法律文本分析和情感分析，
希腊语自然语言处理工具，例如解析器和命名实体识别器，
自然语言处理中的机器学习，尤其是深度学习。

该小组隶属于雅典经济与商业大学信息学系信息处理实验室。

Ilias Chalkidis 代表 AUEB自然语言处理小组

| Github: @ilias.chalkidis | Twitter: @KiddoThe2B |