HuggingFace镜像/xglm_1.7b
模型介绍文件和版本分析
下载使用量0

XGLM-1.7B

XGLM-1.7B 是一个多语言自回归语言模型(拥有 17 亿参数),它在一个包含多种语言的平衡语料库上进行训练,该语料库的子 token 总数达 5000 亿。该模型在论文《Few-shot Learning with Multilingual Language Models》(https://arxiv.org/abs/2112.10668)中被提出,作者为 Xi Victoria Lin*、Todor Mihaylov、Mikel Artetxe、Tianlu Wang、Shuohui Chen、Daniel Simig、Myle Ott、Naman Goyal、Shruti Bhosale、Jingfei Du、Ramakanth Pasunuru、Sam Shleifer、Punit Singh Koura、Vishrav Chaudhary、Brian O'Horo、Jeff Wang、Luke Zettlemoyer、Zornitsa Kozareva、Mona Diab、Veselin Stoyanov、Xian Li*(* 同等贡献)。其原始实现发布于 此代码库。

修改

修改 README.md 中的示例,并增加 NPU 支持。

训练数据统计

XGLM-1.7B 的训练数据统计如下表所示。

ISO-639-1语系语言名称# tokens占比低资源上采样后占比
en印欧语系英语8035267361240.4899060.3259
ru印欧语系俄语1477918980980.09010790.0602
zh汉藏语系中文1327704946300.08094940.0483
de印欧语系德语892237078560.05439920.0363
es印欧语系西班牙语873030831050.05322820.0353
fr印欧语系法语774196397750.04720230.0313
ja日本语系日语660543645130.0402730.0269
it印欧语系意大利语419304653380.02556480.0171
pt印欧语系葡萄牙语365860324440.02230630.0297
el印欧语系希腊语(现代)287621661590.01753610.0233
ko朝鲜语系韩语200022445350.01219530.0811
fi乌拉尔语系芬兰语168043097220.01024550.0681
id南岛语系印度尼西亚语154235419530.009403650.0125
tr突厥语系土耳其语124131660650.007568240.0101
ar亚非语系阿拉伯语122486073450.007467910.0099
vi南亚语系越南语111991218690.006828040.0091
th壮侗语系泰语108421728070.006610410.044
bg印欧语系保加利亚语97037978690.005916350.0393
ca印欧语系加泰罗尼亚语70758347750.00431410.0287
hi印欧语系印地语34483901100.002102460.014
et乌拉尔语系爱沙尼亚语32868738510.002003990.0133
bn印欧语系孟加拉语16274474500.0009922450.0066
ta达罗毗荼语系泰米尔语14769733970.0009005020.006
ur印欧语系乌尔都语13518919690.0008242410.0055
sw尼日尔 - 刚果语系斯瓦希里语9075161390.0005533070.0037
te达罗毗荼语系泰卢固语6893164850.0004202720.0028
eu孤立语言巴斯克语1053044236.42035e-050.0043
my汉藏语系缅甸语1013583316.17976e-050.003
ht克里奥尔语海地语、海地克里奥尔语865846975.27902e-050.0035
qu克丘亚语系克丘亚语32361081.97304e-060.0001

模型卡片

关于模型的预期用途,请参考 xglm_1.7b 开发团队发布的模型卡片。

示例(COPA)

以下代码片段展示了如何使用英语、中文和印地语的示例,在合理替代选择(COPA)任务上评估我们的模型(GPT-3 风格,零样本)。

import torch
import torch.nn.functional as F
from openmind import is_torch_npu_available, AutoTokenizer
from transformers import XGLMForCausalLM
 

if is_torch_npu_available():
    device = "npu:0"
elif torch.cuda.is_available():
    device = "cuda:0"
else:
    device = "cpu"

model_name_or_path = 'PyTorch-NPU/xglm_1.7b'
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = XGLMForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True, device_map=device)

data_samples = {
    'en': [
        {
            "premise": "I wanted to conserve energy.",
            "choice1": "I swept the floor in the unoccupied room.",
            "choice2": "I shut off the light in the unoccupied room.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "The flame on the candle went out.",
            "choice1": "I blew on the wick.",
            "choice2": "I put a match to the wick.",
            "question": "cause",
            "label": "0"
        }
    ],
    'zh': [
        {
            "premise": "我想节约能源。",
            "choice1": "我在空着的房间里扫了地板。",
            "choice2": "我把空房间里的灯关了。",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "蜡烛上的火焰熄灭了。",
            "choice1": "我吹灭了灯芯。",
            "choice2": "我把一根火柴放在灯芯上。",
            "question": "cause",
            "label": "0"
        }
    ],
    'hi': [
        {
            "premise": "M te vle konsève enèji.",
            "choice1": "Mwen te fin baleye chanm lib la.",
            "choice2": "Mwen te femen limyè nan chanm lib la.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "Flam bouji a te etenn.",
            "choice1": "Mwen te soufle bouji a.",
            "choice2": "Mwen te limen mèch bouji a.",
            "question": "cause",
            "label": "0"
        }
    ]
}


def get_logprobs(prompt, device):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    input_ids, output_ids = inputs["input_ids"], inputs["input_ids"][:, 1:]
    outputs = model(**inputs, labels=input_ids)
    logits = outputs.logits
    logprobs = torch.gather(F.log_softmax(logits, dim=2), 2,
                            output_ids.unsqueeze(2))
    return logprobs


def COPA_eval(prompt, alternative1, alternative2, device):
    lprob1 = get_logprobs(prompt + "\n" + alternative1, device).sum()
    lprob2 = get_logprobs(prompt + "\n" + alternative2, device).sum()
    return 0 if lprob1 > lprob2 else 1


for lang in data_samples:
    for idx, example in enumerate(data_samples[lang]):
        predict = COPA_eval(example["premise"], example["choice1"],
                            example["choice2"], device)
        print(f'{lang}-{idx}', predict, example['label'])