HuggingFace镜像/japanese-gpt2-small-openmind

japanese-gpt2-small

rinna-icon

本仓库提供了一个小尺寸的日语GPT-2模型。该模型使用rinnakk/japanese-pretrained-models GitHub仓库中的代码进行训练，由rinna Co., Ltd.开发。

模型使用方法

from openmind import AutoTokenizer, AutoModelForCausalLM, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/japanese-gpt2-small-openmind",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
        
    model = model_path
    tokenizer = AutoTokenizer.from_pretrained(model)
    pipeline = openmind.pipeline(
        "text-generation",
        model=model,
        torch_dtype=torch.float16,
        device_map="auto",
    )

    sequences = pipeline(
        '簡単にサッカー日本代表を紹介します',
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        repetition_penalty=1.5,
        eos_token_id=tokenizer.eos_token_id,
        max_length=500,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
    
if __name__ == "__main__":
    main()

模型架构

一个基于 Transformer 的语言模型，包含 12 层，隐藏层大小为 768。

训练

该模型在 Japanese CC-100 和 Japanese Wikipedia 数据集上进行训练，以优化传统语言建模目标。训练使用 8 台 V100 GPU，耗时约 15 天。在从 CC-100 中选取的验证集上，该模型的困惑度（perplexity）约为 21。

分词

该模型采用基于 sentencepiece 的分词器，其词汇表是使用官方 sentencepiece 训练脚本在 Japanese Wikipedia 上训练得到的。

引用方式

@misc{rinna-japanese-gpt2-small,
    title = {rinna/japanese-gpt2-small},
    author = {Zhao, Tianyu and Sawada, Kei},
    url = {https://huggingface.co/rinna/japanese-gpt2-small}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

许可证

MIT 许可证