HuggingFace镜像/deberta-v3-xsmall
模型介绍文件和版本分析
下载使用量0

DeBERTaV3:通过ELECTRA风格预训练与梯度解耦嵌入共享提升DeBERTa性能

DeBERTa借助解耦注意力机制和增强的掩码解码器对BERT和RoBERTa模型进行了改进。凭借这两项改进,在80GB训练数据的支持下,DeBERTa在大多数自然语言理解(NLU)任务上的表现均优于RoBERTa。

在DeBERTa V3中,我们通过采用ELECTRA风格的预训练方法并结合梯度解耦嵌入共享技术,进一步提升了DeBERTa的效率。与DeBERTa相比,我们的V3版本在下游任务上的模型性能得到了显著提升。有关新模型的更多技术细节,可参阅我们的论文。

更多实现细节和更新,请查看官方仓库。

DeBERTa V3 xsmall模型包含12层,隐藏层大小为384。其骨干网络参数仅为2200万,词汇表包含128K个token,这使得嵌入层引入了4800万参数。该模型使用与DeBERTa V2相同的160GB数据进行训练。

在NLU任务上的微调

我们展示了在SQuAD 2.0和MNLI任务上的开发集结果。

模型词汇表大小(K)骨干网络参数数量(M)SQuAD 2.0(F1/EM)MNLI-m/mm(准确率)
RoBERTa-base508683.7/80.587.6/-
XLNet-base3292-/80.286.8/-
ELECTRA-base3086-/80.588.8/
DeBERTa-base5010086.2/83.188.8/88.5
DeBERTa-v3-large12830491.5/89.091.8/91.9
DeBERTa-v3-base1288688.4/85.490.6/90.7
DeBERTa-v3-small1284482.8/80.488.3/87.7
DeBERTa-v3-xsmall1282284.8/82.088.1/88.3
DeBERTa-v3-xsmall+SiFT12822-/-88.4/88.5

使用HF transformers进行微调

#!/bin/bash

cd transformers/examples/pytorch/text-classification/

pip install datasets
export TASK_NAME=mnli

output_dir="ds_results"

num_gpus=8

batch_size=8

python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
  run_glue.py \
  --model_name_or_path microsoft/deberta-v3-xsmall \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --evaluation_strategy steps \
  --max_seq_length 256 \
  --warmup_steps 1000 \
  --per_device_train_batch_size ${batch_size} \
  --learning_rate 4.5e-5 \
  --num_train_epochs 3 \
  --output_dir $output_dir \
  --overwrite_output_dir \
  --logging_steps 1000 \
  --logging_dir $output_dir

示例推理代码

from openmind import AutoModelForSequenceClassification,AutoTokenizer, AutoModel, is_torch_npu_available
from openmind_hub import snapshot_download
import torch
import argparse
import torch.nn.functional as F


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="zhouhui/deberta-v3-xsmall",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
        
    
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    model = AutoModelForSequenceClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

    premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
    hypothesis = "The movie was good."

    input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
    output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
    prediction = torch.softmax(output["logits"][0], -1).tolist()
    label_names = ["entailment", "neutral", "contradiction"]
    prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
    print(prediction)


if __name__ == "__main__":
    main()

引用

如果您发现 DeBERTa 对您的工作有所帮助,请引用以下论文:

@misc{he2021debertav3,
      title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, 
      author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2111.09543},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}