HuggingFace镜像/multilingual-MiniLMv2-L6-mnli-xnli-openmind

Multilingual MiniLMv2-L6-mnli-xnli

模型说明

该多语言模型可对100多种语言执行自然语言推理（NLI）任务，因此也适用于多语言零样本分类。其基础模型multilingual-MiniLM-L6由微软开发，是通过对XLM-RoBERTa-large进行知识蒸馏得到的（详见原始论文及此仓库中的更新信息）。随后，该模型在包含15种语言假设-前提对的XNLI数据集以及英文MNLI数据集上进行了微调。

蒸馏模型的主要优势在于，与它们的教师模型（XLM-RoBERTa-large）相比，体积更小（推理速度更快，内存需求更低）。缺点是它们在性能上会略逊于规模更大的教师模型。

若追求最高的推理速度，推荐使用此6层模型。若追求更高性能，推荐使用mDeBERTa-v3-base-mnli-xnli（截至2023年2月14日）。

模型使用方法

在Openmind中使用

import torch
import argparse
from openmind import pipeline, is_torch_npu_available
import time

def parse_args():
    parser = argparse.ArgumentParser(description="Eval the model")
    parser.add_argument(
       "--model_name_or_path",
       type=str,
       help="path or model",
       default="jeffding/multilingual-MiniLMv2-L6-mnli-xnli-openmind",
    )
    args = parser.parse_args()
    return args


def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
       device = "npu:0"
    else:
       device = "cpu"
    
    start_time = time.time()

    classifier = pipeline("zero-shot-classification", model=model_path,device_map=device)

    sequence_to_classify = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
    candidate_labels = ["politics", "economy", "entertainment", "environment"]
    output = classifier(sequence_to_classify, candidate_labels, multi_label=False)
    print(output)

    end_time = time.time()
    print(f"硬件环境：{device},推理执行时间：{end_time - start_time}秒")

if __name__ == "__main__":
   main()

简单的零样本分类流水线

from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli")

sequence_to_classify = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
candidate_labels = ["politics", "economy", "entertainment", "environment"]
output = classifier(sequence_to_classify, candidate_labels, multi_label=False)
print(output)

NLI 用例

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
hypothesis = "Emmanuel Macron is the President of France"

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)

训练数据

本模型在XNLI开发数据集和MNLI训练数据集上进行训练。XNLI开发集包含2490条从英语专业翻译为其他14种语言的文本（总计37350条文本）（详见此论文）。需要注意的是，XNLI包含MNLI数据集的15种语言的机器翻译版本作为训练集，但由于这些机器翻译存在质量问题，本模型仅使用XNLI开发集中的专业翻译文本和原始英语MNLI训练集（392702条文本）进行训练。不使用机器翻译文本可以避免模型对15种语言的过拟合，避免灾难性遗忘其预训练时接触的其他语言，并显著降低训练成本。

训练过程

模型使用Hugging Face trainer进行训练，采用以下超参数。其基础模型为mMiniLMv2-L6-H384-distilled-from-XLMR-Large。

training_args = TrainingArguments(
    num_train_epochs=3,              # total number of training epochs
    learning_rate=4e-05,
    per_device_train_batch_size=64,   # batch size per device during training
    per_device_eval_batch_size=120,    # batch size for evaluation
    warmup_ratio=0.06,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
)

评估结果

该模型在XNLI测试集的15种语言上进行了评估（每种语言5010条文本，总计75150条）。请注意，多语言NLI模型能够在特定语言未接受NLI训练数据的情况下对NLI文本进行分类（跨语言迁移）。这意味着该模型也能够对其训练时使用的其他语言进行NLI任务，但性能很可能低于XNLI中包含的语言。

论文中报告的multilingual-MiniLM-L6在XNLI上的平均性能为0.68（参见表11）。本重新实现版本的平均性能为0.713。性能提升可能得益于训练数据中添加了MNLI，并且该模型是从XLM-RoBERTa-large（而非-base）蒸馏而来（multilingual-MiniLM-L6-v2）。

数据集	avg_xnli	ar	bg	de	el	en	es	fr	hi	ru	sw	th	tr	ur	vi	zh
准确率	0.713	0.687	0.742	0.719	0.723	0.789	0.748	0.741	0.691	0.714	0.642	0.699	0.696	0.664	0.723	0.721
速度（文本/秒）（A100 GPU，eval_batch=120）	6093.0	6210.0	6003.0	6053.0	5409.0	6531.0	6205.0	5615.0	5734.0	5970.0	6219.0	6289.0	6533.0	5851.0	5970.0	6798.0

数据集	mnli_m	mnli_mm
准确率	0.782	0.8
速度（文本/秒）（A100 GPU，eval_batch=120）	4430.0	4395.0

局限性与偏差

有关潜在偏差，请参考原始论文以及关于不同NLI数据集的文献。

引用

如果您使用此模型，请引用：Laurer, Moritz, Wouter van Atteveldt, Andreu Salleras Casas, and Kasper Welbers. 2022. ‘Less Annotating, More Classifying – Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT - NLI’. Preprint, June. Open Science Framework. https://osf.io/74b8k.

合作意向或问题？

如果您有问题或合作想法，请通过m{dot}laurer{at}vu{dot}nl联系我，或访问LinkedIn。