XGLM-564M

XGLM-564M 是一个多语言自回归语言模型（拥有 5.64 亿参数），它在包含 30 种不同语言的平衡语料库上进行训练，语料库总规模达 5000 亿个子词。该模型由 Xi Victoria Lin*、Todor Mihaylov、Mikel Artetxe、Tianlu Wang、Shuohui Chen、Daniel Simig、Myle Ott、Naman Goyal、Shruti Bhosale、Jingfei Du、Ramakanth Pasunuru、Sam Shleifer、Punit Singh Koura、Vishrav Chaudhary、Brian O'Horo、Jeff Wang、Luke Zettlemoyer、Zornitsa Kozareva、Mona Diab、Veselin Stoyanov、Xian Li*（*同等贡献）在论文《Few-shot Learning with Multilingual Language Models》中提出。其原始实现发布于此仓库。

训练数据统计

XGLM-564M 的训练数据统计如下表所示。

ISO-639-1	语系	语言名称	词元数量	占比	低资源上采样后占比
en	印欧语系	英语	803526736124	0.489906	0.3259
ru	印欧语系	俄语	147791898098	0.0901079	0.0602
zh	汉藏语系	中文	132770494630	0.0809494	0.0483
de	印欧语系	德语	89223707856	0.0543992	0.0363
es	印欧语系	西班牙语	87303083105	0.0532282	0.0353
fr	印欧语系	法语	77419639775	0.0472023	0.0313
ja	日本语系	日语	66054364513	0.040273	0.0269
it	印欧语系	意大利语	41930465338	0.0255648	0.0171
pt	印欧语系	葡萄牙语	36586032444	0.0223063	0.0297
el	印欧语系	希腊语（现代）	28762166159	0.0175361	0.0233
ko	朝鲜语系	韩语	20002244535	0.0121953	0.0811
fi	乌拉尔语系	芬兰语	16804309722	0.0102455	0.0681
id	南岛语系	印度尼西亚语	15423541953	0.00940365	0.0125
tr	突厥语系	土耳其语	12413166065	0.00756824	0.0101
ar	亚非语系	阿拉伯语	12248607345	0.00746791	0.0099
vi	南亚语系	越南语	11199121869	0.00682804	0.0091
th	壮侗语系	泰语	10842172807	0.00661041	0.044
bg	印欧语系	保加利亚语	9703797869	0.00591635	0.0393
ca	印欧语系	加泰罗尼亚语	7075834775	0.0043141	0.0287
hi	印欧语系	印地语	3448390110	0.00210246	0.014
et	乌拉尔语系	爱沙尼亚语	3286873851	0.00200399	0.0133
bn	印欧语系	孟加拉语	1627447450	0.000992245	0.0066
ta	达罗毗荼语系	泰米尔语	1476973397	0.000900502	0.006
ur	印欧语系	乌尔都语	1351891969	0.000824241	0.0055
sw	尼日尔-刚果语系	斯瓦希里语	907516139	0.000553307	0.0037
te	达罗毗荼语系	泰卢固语	689316485	0.000420272	0.0028
eu	孤立语言	巴斯克语	105304423	6.42035e-05	0.0043
my	汉藏语系	缅甸语	101358331	6.17976e-05	0.003
ht	克里奥尔语	海地语，海地克里奥尔语	86584697	5.27902e-05	0.0035
qu	克丘亚语系	克丘亚语	3236108	1.97304e-06	0.0001

模型卡片

关于模型的预期用途，请参考XGLM-564M开发团队发布的模型卡片。

示例（COPA）

以下代码片段展示了如何使用英语、中文和印地语的示例，在合理替代选择（COPA）任务上评估我们的模型（GPT-3风格，零样本）。

from openmind import AutoTokenizer, AutoModelForCausalLM, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse

def get_logprobs(prompt):
        inputs = tokenizer(prompt, return_tensors="pt")
        input_ids, output_ids = inputs["input_ids"], inputs["input_ids"][:, 1:]
        outputs = model(**inputs, labels=input_ids)
        logits = outputs.logits
        logprobs = torch.gather(F.log_softmax(logits, dim=2), 2, output_ids.unsqueeze(2))
        return logprobs

# Zero-shot evaluation for the Choice of Plausible Alternatives (COPA) task.
# A return value of 0 indicates that the first alternative is more plausible,
# while 1 indicates that the second alternative is more plausible.
def COPA_eval(prompt, alternative1, alternative2):
    lprob1 = get_logprobs(prompt + "\n" + alternative1).sum()
    lprob2 = get_logprobs(prompt + "\n" + alternative2).sum()
    return 0 if lprob1 > lprob2 else 1

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/xglm-564M-openmind",
    )
    args = parser.parse_args()
    return args


args = parse_args()
model_path = args.model_name_or_path

if is_torch_npu_available():
    device = "npu:0"
else:
    device = "cpu"
        
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

data_samples = {
    'en': [
        {
            "premise": "I wanted to conserve energy.",
            "choice1": "I swept the floor in the unoccupied room.",
            "choice2": "I shut off the light in the unoccupied room.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "The flame on the candle went out.",
            "choice1": "I blew on the wick.",
            "choice2": "I put a match to the wick.",
            "question": "cause",
            "label": "0"
        }
    ],
    'zh': [
        {
            "premise": "我想节约能源。",
            "choice1": "我在空着的房间里扫了地板。",
            "choice2": "我把空房间里的灯关了。",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "蜡烛上的火焰熄灭了。",
            "choice1": "我吹灭了灯芯。",
            "choice2": "我把一根火柴放在灯芯上。",
            "question": "cause",
            "label": "0"
        }
    ],
    'hi': [
        {
            "premise": "M te vle konsève enèji.",
            "choice1": "Mwen te fin baleye chanm lib la.",
            "choice2": "Mwen te femen limyè nan chanm lib la.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "Flam bouji a te etenn.",
            "choice1": "Mwen te soufle bouji a.",
            "choice2": "Mwen te limen mèch bouji a.",
            "question": "cause",
            "label": "0"
        }
    ]
}



for lang in data_samples:
    for idx, example in enumerate(data_samples[lang]):
        predict = COPA_eval(example["premise"], example["choice1"], example["choice2"])
        print(f'{lang}-{idx}', predict, example['label'])

XGLM-564M

训练数据统计

XGLM-564M 的训练数据统计如下表所示。

ISO-639-1	语系	语言名称	词元数量	占比	低资源上采样后占比
en	印欧语系	英语	803526736124	0.489906	0.3259
ru	印欧语系	俄语	147791898098	0.0901079	0.0602
zh	汉藏语系	中文	132770494630	0.0809494	0.0483
de	印欧语系	德语	89223707856	0.0543992	0.0363
es	印欧语系	西班牙语	87303083105	0.0532282	0.0353
fr	印欧语系	法语	77419639775	0.0472023	0.0313
ja	日本语系	日语	66054364513	0.040273	0.0269
it	印欧语系	意大利语	41930465338	0.0255648	0.0171
pt	印欧语系	葡萄牙语	36586032444	0.0223063	0.0297
el	印欧语系	希腊语（现代）	28762166159	0.0175361	0.0233
ko	朝鲜语系	韩语	20002244535	0.0121953	0.0811
fi	乌拉尔语系	芬兰语	16804309722	0.0102455	0.0681
id	南岛语系	印度尼西亚语	15423541953	0.00940365	0.0125
tr	突厥语系	土耳其语	12413166065	0.00756824	0.0101
ar	亚非语系	阿拉伯语	12248607345	0.00746791	0.0099
vi	南亚语系	越南语	11199121869	0.00682804	0.0091
th	壮侗语系	泰语	10842172807	0.00661041	0.044
bg	印欧语系	保加利亚语	9703797869	0.00591635	0.0393
ca	印欧语系	加泰罗尼亚语	7075834775	0.0043141	0.0287
hi	印欧语系	印地语	3448390110	0.00210246	0.014
et	乌拉尔语系	爱沙尼亚语	3286873851	0.00200399	0.0133
bn	印欧语系	孟加拉语	1627447450	0.000992245	0.0066
ta	达罗毗荼语系	泰米尔语	1476973397	0.000900502	0.006
ur	印欧语系	乌尔都语	1351891969	0.000824241	0.0055
sw	尼日尔-刚果语系	斯瓦希里语	907516139	0.000553307	0.0037
te	达罗毗荼语系	泰卢固语	689316485	0.000420272	0.0028
eu	孤立语言	巴斯克语	105304423	6.42035e-05	0.0043
my	汉藏语系	缅甸语	101358331	6.17976e-05	0.003
ht	克里奥尔语	海地语，海地克里奥尔语	86584697	5.27902e-05	0.0035
qu	克丘亚语系	克丘亚语	3236108	1.97304e-06	0.0001

示例（COPA）

以下代码片段展示了如何使用英语、中文和印地语的示例，在合理替代选择（COPA）任务上评估我们的模型（GPT-3风格，零样本）。

from openmind import AutoTokenizer, AutoModelForCausalLM, is_torch_npu_available
from openmind_hub import snapshot_download
import torch.nn.functional as F
from torch import Tensor
import openmind
import torch
import argparse

def get_logprobs(prompt):
        inputs = tokenizer(prompt, return_tensors="pt")
        input_ids, output_ids = inputs["input_ids"], inputs["input_ids"][:, 1:]
        outputs = model(**inputs, labels=input_ids)
        logits = outputs.logits
        logprobs = torch.gather(F.log_softmax(logits, dim=2), 2, output_ids.unsqueeze(2))
        return logprobs

# Zero-shot evaluation for the Choice of Plausible Alternatives (COPA) task.
# A return value of 0 indicates that the first alternative is more plausible,
# while 1 indicates that the second alternative is more plausible.
def COPA_eval(prompt, alternative1, alternative2):
    lprob1 = get_logprobs(prompt + "\n" + alternative1).sum()
    lprob2 = get_logprobs(prompt + "\n" + alternative2).sum()
    return 0 if lprob1 > lprob2 else 1

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="jeffding/xglm-564M-openmind",
    )
    args = parser.parse_args()
    return args


args = parse_args()
model_path = args.model_name_or_path

if is_torch_npu_available():
    device = "npu:0"
else:
    device = "cpu"
        
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

data_samples = {
    'en': [
        {
            "premise": "I wanted to conserve energy.",
            "choice1": "I swept the floor in the unoccupied room.",
            "choice2": "I shut off the light in the unoccupied room.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "The flame on the candle went out.",
            "choice1": "I blew on the wick.",
            "choice2": "I put a match to the wick.",
            "question": "cause",
            "label": "0"
        }
    ],
    'zh': [
        {
            "premise": "我想节约能源。",
            "choice1": "我在空着的房间里扫了地板。",
            "choice2": "我把空房间里的灯关了。",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "蜡烛上的火焰熄灭了。",
            "choice1": "我吹灭了灯芯。",
            "choice2": "我把一根火柴放在灯芯上。",
            "question": "cause",
            "label": "0"
        }
    ],
    'hi': [
        {
            "premise": "M te vle konsève enèji.",
            "choice1": "Mwen te fin baleye chanm lib la.",
            "choice2": "Mwen te femen limyè nan chanm lib la.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "Flam bouji a te etenn.",
            "choice1": "Mwen te soufle bouji a.",
            "choice2": "Mwen te limen mèch bouji a.",
            "question": "cause",
            "label": "0"
        }
    ]
}



for lang in data_samples:
    for idx, example in enumerate(data_samples[lang]):
        predict = COPA_eval(example["premise"], example["choice1"], example["choice2"])
        print(f'{lang}-{idx}', predict, example['label'])