afro-xlmr-base

AfroXLMR-base 是通过对 XLM-R-base 模型在 17 种非洲语言（南非荷兰语、阿姆哈拉语、豪萨语、伊博语、马达加斯加语、齐切瓦语、奥罗莫语、尼日利亚皮钦语、基尼亚卢旺达语、基隆迪语、绍纳语、索马里语、塞索托语、斯瓦希里语、科萨语、约鲁巴语和祖鲁语）进行 MLM 适配而创建的，这些语言涵盖了主要的非洲语系，同时还包括 3 种高资源语言（阿拉伯语、法语和英语）。

在 Openmind 中的使用

from openmind import pipeline, AutoTokenizer, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse
import time

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/afro-xlmr-base-openmind",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    
    start_time = time.time()
    
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
    pipe = pipeline('fill-mask', model=model_path, torch_dtype=torch.bfloat16, device_map=device)
    MASK_TOKEN = tokenizer.mask_token
    result = pipe("Hello I'm a {} model.".format(MASK_TOKEN))
    print(result)
    
    end_time = time.time()
    print(f"硬件环境：{device},推理执行时间：{end_time - start_time}秒")
    
if __name__ == "__main__":
    main()

MasakhaNER 评估结果（F值）

语言	XLM-R-miniLM	XLM-R-base	XLM-R-large	afro-xlmr-base	afro-xlmr-small	afro-xlmr-mini
阿姆哈拉语	69.5	70.6	76.2	76.1	70.1	69.7
豪萨语	74.5	89.5	90.5	91.2	91.4	87.7
伊博语	81.9	84.8	84.1	87.4	86.6	83.5
基尼亚卢旺达语	68.6	73.3	73.8	78.0	77.5	74.1
卢干达语	64.7	79.7	81.6	82.9	83.2	77.4
卢奥语	11.7	74.9	73.6	75.1	75.4	17.5
尼日利亚皮钦语	83.2	87.3	89.0	89.6	89.0	85.5
斯瓦希里语	86.3	87.4	89.4	88.6	88.7	86.0
沃洛夫语	51.7	63.9	67.9	67.4	65.9	59.0
约鲁巴语	72.0	78.3	78.9	82.1	81.3	75.1

BibTeX 条目和引用信息

@inproceedings{alabi-etal-2022-adapting,
    title = "Adapting Pre-trained Language Models to {A}frican Languages via Multilingual Adaptive Fine-Tuning",
    author = "Alabi, Jesujoba O.  and
      Adelani, David Ifeoluwa  and
      Mosbach, Marius  and
      Klakow, Dietrich",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.382",
    pages = "4336--4349",
    abstract = "Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is language adaptive fine-tuning (LAFT) {---} fine-tuning a multilingual PLM on monolingual texts of a language using the pre-training objective. However, adapting to target language individually takes large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform multilingual adaptive fine-tuning on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent to encourage cross-lingual transfer learning. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50{\%}. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Additionally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.",
}

afro-xlmr-base

在 Openmind 中的使用

from openmind import pipeline, AutoTokenizer, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse
import time

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/afro-xlmr-base-openmind",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    
    start_time = time.time()
    
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
    pipe = pipeline('fill-mask', model=model_path, torch_dtype=torch.bfloat16, device_map=device)
    MASK_TOKEN = tokenizer.mask_token
    result = pipe("Hello I'm a {} model.".format(MASK_TOKEN))
    print(result)
    
    end_time = time.time()
    print(f"硬件环境：{device},推理执行时间：{end_time - start_time}秒")
    
if __name__ == "__main__":
    main()

MasakhaNER 评估结果（F值）

语言	XLM-R-miniLM	XLM-R-base	XLM-R-large	afro-xlmr-base	afro-xlmr-small	afro-xlmr-mini
阿姆哈拉语	69.5	70.6	76.2	76.1	70.1	69.7
豪萨语	74.5	89.5	90.5	91.2	91.4	87.7
伊博语	81.9	84.8	84.1	87.4	86.6	83.5
基尼亚卢旺达语	68.6	73.3	73.8	78.0	77.5	74.1
卢干达语	64.7	79.7	81.6	82.9	83.2	77.4
卢奥语	11.7	74.9	73.6	75.1	75.4	17.5
尼日利亚皮钦语	83.2	87.3	89.0	89.6	89.0	85.5
斯瓦希里语	86.3	87.4	89.4	88.6	88.7	86.0
沃洛夫语	51.7	63.9	67.9	67.4	65.9	59.0
约鲁巴语	72.0	78.3	78.9	82.1	81.3	75.1

BibTeX 条目和引用信息

@inproceedings{alabi-etal-2022-adapting,
    title = "Adapting Pre-trained Language Models to {A}frican Languages via Multilingual Adaptive Fine-Tuning",
    author = "Alabi, Jesujoba O.  and
      Adelani, David Ifeoluwa  and
      Mosbach, Marius  and
      Klakow, Dietrich",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.382",
    pages = "4336--4349",
    abstract = "Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is language adaptive fine-tuning (LAFT) {---} fine-tuning a multilingual PLM on monolingual texts of a language using the pre-training objective. However, adapting to target language individually takes large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform multilingual adaptive fine-tuning on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent to encourage cross-lingual transfer learning. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50{\%}. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Additionally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.",
}