近年来,预训练语言模型(PLMs)在各类自然语言处理(NLP)任务中均取得了最佳性能。尽管最初的模型是在通用领域数据上训练的,但为了更有效地处理特定领域,专门化的模型已应运而生。 本文针对法语医学领域的预训练语言模型展开了一项原创性研究。我们首次对比了基于网络公开数据和医疗机构私有数据训练的预训练语言模型的性能。我们还在一系列生物医学任务上评估了不同的学习策略。 最后,我们发布了首个面向法语生物医学领域的专用预训练语言模型DrBERT,以及用于训练这些模型的、最大规模的免费授权医疗数据集。
加载模型和分词器:
import argparse
import torch
import numpy as np
from openmind import pipeline, is_torch_npu_available
from openmind import AutoModelForCausalLM,AutoTokenizer
def parse_args():
parser = argparse.ArgumentParser(description="Eval the model")
parser.add_argument(
"--model_name_or_path",
type=str,
help="path or model",
default="ChongqingAscend/DrBERT_7GB",
)
args = parser.parse_args()
return args
def main():
args = parse_args()
model_path = args.model_name_or_path
device = "npu" if is_torch_npu_available() else "cpu"
fill_mask = pipeline("fill-mask", model=model_path,tokenizer=model_path,device=device)
results = fill_mask("La patiente est atteinte d'une <mask>")
print(results)
if __name__ == "__main__":
main()
@inproceedings{labrak2023drbert,
title = {{DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains}},
author = {Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Rouvier, Mickael and Morin, Emmanuel and Daille, Béatrice and Gourraud, Pierre-Antoine},
booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL'23), Long Paper},
month = july,
year = 2023,
address = {Toronto, Canada},
publisher = {Association for Computational Linguistics}
}