deberta-v3-xsmall:可用于自然语言理解任务如文本分类、问答等。该项目是DeBERTa V3 xsmall模型，含12层、384隐藏大小，22M骨干参数，采用ELECTRA式预训练，在SQuAD 2.0和MNLI等任务上表现良好。【此简介由AI生成】

DeBERTaV3：通过ELECTRA风格预训练与梯度解耦嵌入共享提升DeBERTa性能

DeBERTa借助解耦注意力机制和增强的掩码解码器对BERT和RoBERTa模型进行了改进。凭借这两项改进，在80GB训练数据的支持下，DeBERTa在大多数自然语言理解（NLU）任务上的表现均优于RoBERTa。

在DeBERTa V3中，我们通过采用ELECTRA风格的预训练方法并结合梯度解耦嵌入共享技术，进一步提升了DeBERTa的效率。与DeBERTa相比，我们的V3版本在下游任务上的模型性能得到了显著提升。有关新模型的更多技术细节，可参阅我们的论文。

更多实现细节和更新，请查看官方仓库。

DeBERTa V3 xsmall模型包含12层，隐藏层大小为384。其骨干网络参数仅为2200万，词汇表包含128K个token，这使得嵌入层引入了4800万参数。该模型使用与DeBERTa V2相同的160GB数据进行训练。

在NLU任务上的微调

我们展示了在SQuAD 2.0和MNLI任务上的开发集结果。

模型	词汇表大小(K)	骨干网络参数数量(M)	SQuAD 2.0(F1/EM)	MNLI-m/mm(准确率)
RoBERTa-base	50	86	83.7/80.5	87.6/-
XLNet-base	32	92	-/80.2	86.8/-
ELECTRA-base	30	86	-/80.5	88.8/
DeBERTa-base	50	100	86.2/83.1	88.8/88.5
DeBERTa-v3-large	128	304	91.5/89.0	91.8/91.9
DeBERTa-v3-base	128	86	88.4/85.4	90.6/90.7
DeBERTa-v3-small	128	44	82.8/80.4	88.3/87.7
DeBERTa-v3-xsmall	128	22	84.8/82.0	88.1/88.3
DeBERTa-v3-xsmall+SiFT	128	22	-/-	88.4/88.5

使用HF transformers进行微调

#!/bin/bash

cd transformers/examples/pytorch/text-classification/

pip install datasets
export TASK_NAME=mnli

output_dir="ds_results"

num_gpus=8

batch_size=8

python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
  run_glue.py \
  --model_name_or_path microsoft/deberta-v3-xsmall \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --evaluation_strategy steps \
  --max_seq_length 256 \
  --warmup_steps 1000 \
  --per_device_train_batch_size ${batch_size} \
  --learning_rate 4.5e-5 \
  --num_train_epochs 3 \
  --output_dir $output_dir \
  --overwrite_output_dir \
  --logging_steps 1000 \
  --logging_dir $output_dir

示例推理代码

from openmind import AutoModelForSequenceClassification,AutoTokenizer, AutoModel, is_torch_npu_available
from openmind_hub import snapshot_download
import torch
import argparse
import torch.nn.functional as F


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="zhouhui/deberta-v3-xsmall",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
        
    
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    model = AutoModelForSequenceClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

    premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
    hypothesis = "The movie was good."

    input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
    output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
    prediction = torch.softmax(output["logits"][0], -1).tolist()
    label_names = ["entailment", "neutral", "contradiction"]
    prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
    print(prediction)


if __name__ == "__main__":
    main()

引用

如果您发现 DeBERTa 对您的工作有所帮助，请引用以下论文：

@misc{he2021debertav3,
      title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, 
      author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2111.09543},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}