indobert-base-uncased:可用于印尼语的文本分类等自然语言处理任务，该项目是印尼语版BERT模型，使用超2.2亿词训练，在词性标注、命名实体识别等多项任务上表现优异，支持NPU硬件。【此简介由AI生成】

关于

IndoBERT 是印尼语版本的 BERT 模型。我们使用超过 2.2 亿个单词训练该模型，这些单词来源于三个主要渠道：

印尼语维基百科（7400 万个单词）
来自 Kompas、Tempo（Tala et al., 2003）和 Liputan6 的新闻文章（共计 5500 万个单词）
印尼语网络语料库（Medved and Suchomel, 2017）（9000 万个单词）

我们对模型进行了 240 万步（180 个 epoch）的训练，最终在开发集上的困惑度为 3.97（与英语 BERT-base 相当）。

此 IndoBERT 用于研究 IndoLEM——一个印尼语基准测试集，包含七个印尼语任务，涵盖形态句法、语义和语篇。

任务	指标	Bi-LSTM	mBERT	MalayBERT	IndoBERT
词性标注	准确率	95.4	96.8	96.8	96.8
命名实体识别 UGM	F1 值	70.9	71.6	73.2	74.9
命名实体识别 UI	F1 值	82.2	82.2	87.4	90.1
依存句法分析（UD-Indo-GSD）	UAS/LAS	85.25/80.35	86.85/81.78	86.99/81.87	87.12/82.32
依存句法分析（UD-Indo-PUD）	UAS/LAS	84.04/79.01	90.58/85.44	88.91/83.56	89.23/83.95
情感分析	F1 值	71.62	76.58	82.02	84.13
文本摘要	R1/R2/RL	67.96/61.65/67.24	68.40/61.66/67.67	68.44/61.38/67.71	69.93/62.86/69.21
下一条推文预测	准确率	73.6	92.4	93.1	93.7
推文排序	斯皮尔曼相关系数	0.45	0.53	0.51	0.59

该论文发表于 2020 年第 28 届 COLING 会议。有关基准测试的更多详细信息，请参阅 https://indolem.github.io。

使用方法

加载模型和分词器（已使用 transformers==3.5.1 测试）

from transformers import AutoTokenizer, AutoModel
import torch
import torch_npu

device = torch.device('npu:0')
tokenizer = AutoTokenizer.from_pretrained("indolem/indobert-base-uncased")
model = AutoModel.from_pretrained("indolem/indobert-base-uncased").to(device)

引用说明

如果您使用了我们的研究成果，请引用：

@inproceedings{koto2020indolem,
  title={IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP},
  author={Fajri Koto and Afshin Rahimi and Jey Han Lau and Timothy Baldwin},
  booktitle={Proceedings of the 28th COLING},
  year={2020}
}

关于