DeBERTa借助解耦注意力机制和增强的掩码解码器对BERT和RoBERTa模型进行了改进。凭借这两项改进,在80GB训练数据的支持下,DeBERTa在大多数自然语言理解(NLU)任务上的表现均优于RoBERTa。
在DeBERTa V3中,我们通过采用ELECTRA风格的预训练方法并结合梯度解耦嵌入共享技术,进一步提升了DeBERTa的效率。与DeBERTa相比,我们的V3版本在下游任务上的模型性能得到了显著提升。有关新模型的更多技术细节,可参阅我们的论文。
更多实现细节和更新,请查看官方仓库。
DeBERTa V3 xsmall模型包含12层,隐藏层大小为384。其骨干网络参数仅为2200万,词汇表包含128K个token,这使得嵌入层引入了4800万参数。该模型使用与DeBERTa V2相同的160GB数据进行训练。
我们展示了在SQuAD 2.0和MNLI任务上的开发集结果。
| 模型 | 词汇表大小(K) | 骨干网络参数数量(M) | SQuAD 2.0(F1/EM) | MNLI-m/mm(准确率) |
|---|---|---|---|---|
| RoBERTa-base | 50 | 86 | 83.7/80.5 | 87.6/- |
| XLNet-base | 32 | 92 | -/80.2 | 86.8/- |
| ELECTRA-base | 30 | 86 | -/80.5 | 88.8/ |
| DeBERTa-base | 50 | 100 | 86.2/83.1 | 88.8/88.5 |
| DeBERTa-v3-large | 128 | 304 | 91.5/89.0 | 91.8/91.9 |
| DeBERTa-v3-base | 128 | 86 | 88.4/85.4 | 90.6/90.7 |
| DeBERTa-v3-small | 128 | 44 | 82.8/80.4 | 88.3/87.7 |
| DeBERTa-v3-xsmall | 128 | 22 | 84.8/82.0 | 88.1/88.3 |
| DeBERTa-v3-xsmall+SiFT | 128 | 22 | -/- | 88.4/88.5 |
#!/bin/bash
cd transformers/examples/pytorch/text-classification/
pip install datasets
export TASK_NAME=mnli
output_dir="ds_results"
num_gpus=8
batch_size=8
python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
run_glue.py \
--model_name_or_path microsoft/deberta-v3-xsmall \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--evaluation_strategy steps \
--max_seq_length 256 \
--warmup_steps 1000 \
--per_device_train_batch_size ${batch_size} \
--learning_rate 4.5e-5 \
--num_train_epochs 3 \
--output_dir $output_dir \
--overwrite_output_dir \
--logging_steps 1000 \
--logging_dir $output_dir
from openmind import AutoModelForSequenceClassification,AutoTokenizer, AutoModel, is_torch_npu_available
from openmind_hub import snapshot_download
import torch
import argparse
import torch.nn.functional as F
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_name_or_path",
type=str,
help="Path to model",
default="zhouhui/deberta-v3-xsmall",
)
args = parser.parse_args()
return args
def main():
args = parse_args()
model_path = args.model_name_or_path
if is_torch_npu_available():
device = "npu:0"
else:
device = "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
hypothesis = "The movie was good."
input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device)) # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)
if __name__ == "__main__":
main()如果您发现 DeBERTa 对您的工作有所帮助,请引用以下论文:
@misc{he2021debertav3,
title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing},
author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
year={2021},
eprint={2111.09543},
archivePrefix={arXiv},
primaryClass={cs.CL}
}@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}