japanese-gpt2-small

rinna-icon

本仓库提供了一个小尺寸的日语GPT-2模型。该模型是使用rinna Co., Ltd.在Github仓库rinnakk/japanese-pretrained-models中的代码进行训练的。

模型使用方法

# coding = utf-8
import torch
import torch_npu
from transformers import T5Tokenizer, RobertaForMaskedLM

import argparse
from openmind import pipeline, is_torch_npu_available
parser = argparse.ArgumentParser(description='manual to this script')
parser.add_argument("--model_name_or_path", type=str, default="./")
args = parser.parse_args()
model_path = args.model_name_or_path
device = None
if is_torch_npu_available():
    device = "npu:0"
else:
    device = "cpu"

# load tokenizer
tokenizer = T5Tokenizer.from_pretrained(model_path)
tokenizer.do_lower_case = True  # due to some bug of tokenizer config loading

# load model
model = RobertaForMaskedLM.from_pretrained(model_path)
model = model.to(device)
model = model.eval()

# original text
text = "4年に1度オリンピックは開かれる。"

# prepend [CLS]
text = "[CLS]" + text

# tokenize
tokens = tokenizer.tokenize(text)
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。']']

# mask a token
masked_idx = 5
tokens[masked_idx] = tokenizer.mask_token
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']

# convert to ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)  # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8]

# convert to tensor
token_tensor = torch.LongTensor([token_ids])

# provide position ids explicitly
position_ids = list(range(0, token_tensor.size(1)))
print(position_ids)  # output: [0, 1, 2, 3, 4, 5, 6, 7, 8]
position_id_tensor = torch.LongTensor([position_ids])

# get the top 10 predictions of the masked token
with torch.no_grad():
    outputs = model(input_ids=token_tensor.to(device), position_ids=position_id_tensor.to(device))
    predictions = outputs[0][0, masked_idx].topk(10)

for i, index_t in enumerate(predictions.indices):
    index = index_t.item()
    token = tokenizer.convert_ids_to_tokens([index])[0]
    print(i, token)

模型架构

一个基于 Transformer 的语言模型，包含 12 层，隐藏层大小为 768。

训练

该模型在 Japanese CC-100 和 Japanese Wikipedia 数据集上进行训练，以优化传统语言建模目标。训练使用 8 块 V100 GPU，耗时约 15 天。在从 CC-100 中选取的验证集上，模型的困惑度（perplexity）约为 21。

分词

该模型使用基于 sentencepiece 的分词器，其词汇表是使用官方 sentencepiece 训练脚本在 Japanese Wikipedia 上训练得到的。

引用方式

@misc{rinna-japanese-gpt2-small,
    title = {rinna/japanese-gpt2-small},
    author = {Zhao, Tianyu and Sawada, Kei},
    url = {https://https://hf-mirror.com/rinna/japanese-gpt2-small}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

许可证

MIT 许可证

模型使用方法

# coding = utf-8
import torch
import torch_npu
from transformers import T5Tokenizer, RobertaForMaskedLM

import argparse
from openmind import pipeline, is_torch_npu_available
parser = argparse.ArgumentParser(description='manual to this script')
parser.add_argument("--model_name_or_path", type=str, default="./")
args = parser.parse_args()
model_path = args.model_name_or_path
device = None
if is_torch_npu_available():
    device = "npu:0"
else:
    device = "cpu"

# load tokenizer
tokenizer = T5Tokenizer.from_pretrained(model_path)
tokenizer.do_lower_case = True  # due to some bug of tokenizer config loading

# load model
model = RobertaForMaskedLM.from_pretrained(model_path)
model = model.to(device)
model = model.eval()

# original text
text = "4年に1度オリンピックは開かれる。"

# prepend [CLS]
text = "[CLS]" + text

# tokenize
tokens = tokenizer.tokenize(text)
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。']']

# mask a token
masked_idx = 5
tokens[masked_idx] = tokenizer.mask_token
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']

# convert to ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)  # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8]

# convert to tensor
token_tensor = torch.LongTensor([token_ids])

# provide position ids explicitly
position_ids = list(range(0, token_tensor.size(1)))
print(position_ids)  # output: [0, 1, 2, 3, 4, 5, 6, 7, 8]
position_id_tensor = torch.LongTensor([position_ids])

# get the top 10 predictions of the masked token
with torch.no_grad():
    outputs = model(input_ids=token_tensor.to(device), position_ids=position_id_tensor.to(device))
    predictions = outputs[0][0, masked_idx].topk(10)

for i, index_t in enumerate(predictions.indices):
    index = index_t.item()
    token = tokenizer.convert_ids_to_tokens([index])[0]
    print(i, token)

引用方式

@misc{rinna-japanese-gpt2-small,
    title = {rinna/japanese-gpt2-small},
    author = {Zhao, Tianyu and Sawada, Kei},
    url = {https://https://hf-mirror.com/rinna/japanese-gpt2-small}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}