HuggingFace镜像/xglm-564M-openmind
模型介绍文件和版本分析
下载使用量0

XGLM-564M

XGLM-564M 是一个多语言自回归语言模型(拥有 5.64 亿参数),它在包含 30 种不同语言的平衡语料库上进行训练,语料库总规模达 5000 亿个子词。该模型由 Xi Victoria Lin*、Todor Mihaylov、Mikel Artetxe、Tianlu Wang、Shuohui Chen、Daniel Simig、Myle Ott、Naman Goyal、Shruti Bhosale、Jingfei Du、Ramakanth Pasunuru、Sam Shleifer、Punit Singh Koura、Vishrav Chaudhary、Brian O'Horo、Jeff Wang、Luke Zettlemoyer、Zornitsa Kozareva、Mona Diab、Veselin Stoyanov、Xian Li*(*同等贡献)在论文《Few-shot Learning with Multilingual Language Models》中提出。其原始实现发布于 此仓库。

训练数据统计

XGLM-564M 的训练数据统计如下表所示。

ISO-639-1语系语言名称词元数量占比低资源上采样后占比
en印欧语系英语8035267361240.4899060.3259
ru印欧语系俄语1477918980980.09010790.0602
zh汉藏语系中文1327704946300.08094940.0483
de印欧语系德语892237078560.05439920.0363
es印欧语系西班牙语873030831050.05322820.0353
fr印欧语系法语774196397750.04720230.0313
ja日本语系日语660543645130.0402730.0269
it印欧语系意大利语419304653380.02556480.0171
pt印欧语系葡萄牙语365860324440.02230630.0297
el印欧语系希腊语(现代)287621661590.01753610.0233
ko朝鲜语系韩语200022445350.01219530.0811
fi乌拉尔语系芬兰语168043097220.01024550.0681
id南岛语系印度尼西亚语154235419530.009403650.0125
tr突厥语系土耳其语124131660650.007568240.0101
ar亚非语系阿拉伯语122486073450.007467910.0099
vi南亚语系越南语111991218690.006828040.0091
th壮侗语系泰语108421728070.006610410.044
bg印欧语系保加利亚语97037978690.005916350.0393
ca印欧语系加泰罗尼亚语70758347750.00431410.0287
hi印欧语系印地语34483901100.002102460.014
et乌拉尔语系爱沙尼亚语32868738510.002003990.0133
bn印欧语系孟加拉语16274474500.0009922450.0066
ta达罗毗荼语系泰米尔语14769733970.0009005020.006
ur印欧语系乌尔都语13518919690.0008242410.0055
sw尼日尔-刚果语系斯瓦希里语9075161390.0005533070.0037
te达罗毗荼语系泰卢固语6893164850.0004202720.0028
eu孤立语言巴斯克语1053044236.42035e-050.0043
my汉藏语系缅甸语1013583316.17976e-050.003
ht克里奥尔语海地语,海地克里奥尔语865846975.27902e-050.0035
qu克丘亚语系克丘亚语32361081.97304e-060.0001

模型卡片

关于模型的预期用途,请参考XGLM-564M开发团队发布的模型卡片。

示例(COPA)

以下代码片段展示了如何使用英语、中文和印地语的示例,在合理替代选择(COPA)任务上评估我们的模型(GPT-3风格,零样本)。

from openmind import AutoTokenizer, AutoModelForCausalLM, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse

def get_logprobs(prompt):
        inputs = tokenizer(prompt, return_tensors="pt")
        input_ids, output_ids = inputs["input_ids"], inputs["input_ids"][:, 1:]
        outputs = model(**inputs, labels=input_ids)
        logits = outputs.logits
        logprobs = torch.gather(F.log_softmax(logits, dim=2), 2, output_ids.unsqueeze(2))
        return logprobs

# Zero-shot evaluation for the Choice of Plausible Alternatives (COPA) task.
# A return value of 0 indicates that the first alternative is more plausible,
# while 1 indicates that the second alternative is more plausible.
def COPA_eval(prompt, alternative1, alternative2):
    lprob1 = get_logprobs(prompt + "\n" + alternative1).sum()
    lprob2 = get_logprobs(prompt + "\n" + alternative2).sum()
    return 0 if lprob1 > lprob2 else 1

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/xglm-564M-openmind",
    )
    args = parser.parse_args()
    return args


args = parse_args()
model_path = args.model_name_or_path

if is_torch_npu_available():
    device = "npu:0"
else:
    device = "cpu"
        
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

data_samples = {
    'en': [
        {
            "premise": "I wanted to conserve energy.",
            "choice1": "I swept the floor in the unoccupied room.",
            "choice2": "I shut off the light in the unoccupied room.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "The flame on the candle went out.",
            "choice1": "I blew on the wick.",
            "choice2": "I put a match to the wick.",
            "question": "cause",
            "label": "0"
        }
    ],
    'zh': [
        {
            "premise": "我想节约能源。",
            "choice1": "我在空着的房间里扫了地板。",
            "choice2": "我把空房间里的灯关了。",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "蜡烛上的火焰熄灭了。",
            "choice1": "我吹灭了灯芯。",
            "choice2": "我把一根火柴放在灯芯上。",
            "question": "cause",
            "label": "0"
        }
    ],
    'hi': [
        {
            "premise": "M te vle konsève enèji.",
            "choice1": "Mwen te fin baleye chanm lib la.",
            "choice2": "Mwen te femen limyè nan chanm lib la.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "Flam bouji a te etenn.",
            "choice1": "Mwen te soufle bouji a.",
            "choice2": "Mwen te limen mèch bouji a.",
            "question": "cause",
            "label": "0"
        }
    ]
}



for lang in data_samples:
    for idx, example in enumerate(data_samples[lang]):
        predict = COPA_eval(example["premise"], example["choice1"], example["choice2"])
        print(f'{lang}-{idx}', predict, example['label'])