BERT 预训练语言模型的希腊语版本。
bert-base-greek-uncased-v1 的预训练语料库包括:
未来版本还将包括:
bert-base-uncased 模型类似的模型(12 层、768 隐藏维度、12 个注意力头、1.1 亿参数)。* 您仍然可以从这个 Google Drive 文件夹 访问原始的 TensorFlow 检查点。
我们已将 bert-base-greek-uncased-v1 发布到 Hugging Face 的 Transformers 代码库中。因此,您需要通过 pip 安装 transformers 库以及 PyTorch 或 Tensorflow 2。
pip install transformers
pip install (torch|tensorflow)注意: 默认分词器现已原生支持预处理功能。无需再添加以下代码。
使用bert-base-greek-uncased-v1时,需将文本预处理为小写字母并移除所有希腊语变音符号。
import unicodedata
def strip_accents_and_lowercase(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn').lower()
accented_string = "Αυτή είναι η Ελληνική έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase(accented_string)
print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
model = AutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")import torch
from transformers import *
# Load model and tokenizer
tokenizer_greek = AutoTokenizer.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')
lm_model_greek = AutoModelWithLMHead.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')
# ================ EXAMPLE 1 ================
text_1 = 'O ποιητής έγραψε ένα [MASK] .'
# EN: 'The poet wrote a [MASK].'
input_ids = tokenizer_greek.encode(text_1)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'o', 'ποιητης', 'εγραψε', 'ενα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 5].max(0)[1].item()))
# the most plausible prediction for [MASK] is "song"
# ================ EXAMPLE 2 ================
text_2 = 'Είναι ένας [MASK] άνθρωπος.'
# EN: 'He is a [MASK] person.'
input_ids = tokenizer_greek.encode(text_2)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 3].max(0)[1].item()))
# the most plausible prediction for [MASK] is "good"
# ================ EXAMPLE 3 ================
text_3 = 'Είναι ένας [MASK] άνθρωπος και κάνει συχνά [MASK].'
# EN: 'He is a [MASK] person he does frequently [MASK].'
input_ids = tokenizer_greek.encode(text_3)
print(tokenizer_greek.convert_ids_to_tokens(input_ids))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', 'και', 'κανει', 'συχνα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 8].max(0)[1].item()))
# the most plausible prediction for the second [MASK] is "trips"有关详细结果,请阅读论文:
GREEK-BERT: The Greeks visiting Sesame Street. John Koutsikakis, Ilias Chalkidis, Prodromos Malakasiotis and Ion Androutsopoulos. In the Proceedings of the 11th Hellenic Conference on Artificial Intelligence (SETN 2020). Held Online. 2020. (https://arxiv.org/abs/2008.12014)
| 模型名称 | 微平均F1值 |
|---|---|
| BILSTM-CNN-CRF(Ma和Hovy,2016) | 76.4 ± 2.07 |
| M-BERT-UNCASED(Devlin等人,2019) | 81.5 ± 1.77 |
| M-BERT-CASED(Devlin等人,2019) | 82.1 ± 1.35 |
| XLM-R(Conneau等人,2020) | 84.8 ± 1.50 |
| GREEK-BERT(我们的模型) | 85.7 ± 1.00 |
| 模型名称 | 准确率 |
|---|---|
| DAM(Parikh等人,2016) | 68.5 ± 1.71 |
| M-BERT-UNCASED(Devlin等人,2019) | 73.9 ± 0.64 |
| M-BERT-CASED(Devlin等人,2019) | 73.5 ± 0.49 |
| XLM-R(Conneau等人,2020) | 77.3 ± 0.41 |
| GREEK-BERT(我们的模型) | 78.6 ± 0.62 |
该模型随论文“GREEK-BERT: The Greeks visiting Sesame Street. John Koutsikakis, Ilias Chalkidis, Prodromos Malakasiotis and Ion Androutsopoulos. In the Proceedings of the 11th Hellenic Conference on Artificial Intelligence (SETN 2020). Held Online. 2020”(https://arxiv.org/abs/2008.12014)正式发布。
如果您使用该模型,请引用以下文献:
@inproceedings{greek-bert,
author = {Koutsikakis, John and Chalkidis, Ilias and Malakasiotis, Prodromos and Androutsopoulos, Ion},
title = {GREEK-BERT: The Greeks Visiting Sesame Street},
year = {2020},
isbn = {9781450388788},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3411408.3411440},
booktitle = {11th Hellenic Conference on Artificial Intelligence},
pages = {110–117},
numpages = {8},
location = {Athens, Greece},
series = {SETN 2020}
}AUEB自然语言处理小组致力于开发算法、模型和系统,使计算机能够处理和生成自然语言文本。
该小组当前的研究方向包括:
该小组隶属于雅典经济与商业大学信息学系信息处理实验室。
Ilias Chalkidis 代表 AUEB自然语言处理小组
| Github: @ilias.chalkidis | Twitter: @KiddoThe2B |