drawing

DrBERT：面向生物医学与临床领域的稳健法语预训练模型

近年来，预训练语言模型（PLMs）在各类自然语言处理（NLP）任务中均取得了最佳性能。尽管早期模型主要基于通用领域数据进行训练，但为了更有效地处理特定领域问题，专门化的模型应运而生。本文针对法语医学领域的预训练语言模型展开了一项原创性研究。我们首次对比了基于网络公开数据与医疗机构私有数据训练的预训练语言模型的性能，并在一系列生物医学任务上评估了不同的学习策略。最后，我们发布了首个面向法语生物医学领域的专用预训练语言模型DrBERT，以及用于训练这些模型的、目前规模最大的免费授权医学语料库。

1. DrBERT模型

DrBERT是一款基于法语RoBERTa架构的模型，其训练数据来源于名为NACHOS的开源法语医学网络文本语料库。我们利用法国国家科学研究中心（CNRS）的Jean Zay法国超级计算机，对来自不同公共和私有数据源、数据量各异的模型进行了训练。为防止个人信息泄露并遵守欧洲GDPR法规，仅公开发布使用纯开源数据训练的模型权重：

模型名称	语料库	层数	注意力头数	嵌入维度	序列长度	模型链接
`DrBERT-7-GB-cased-Large`	NACHOS 7 GB	24	16	1024	512	HuggingFace
`DrBERT-7-GB-cased`	NACHOS 7 GB	12	12	768	512	HuggingFace
`DrBERT-4-GB-cased`	NACHOS 4 GB	12	12	768	512	HuggingFace
`DrBERT-4-GB-cased-CP-CamemBERT`	NACHOS 4 GB	12	12	768	512	HuggingFace
`DrBERT-4-GB-cased-CP-PubMedBERT`	NACHOS 4 GB	12	12	768	512	HuggingFace

2. 使用 DrBERT

您可以通过 Hugging Face 的 Transformers 库按以下方式使用 DrBERT。

加载模型和分词器：

from openmind import AutoModelForSequenceClassification,AutoTokenizer, AutoModel, is_torch_npu_available
from openmind_hub import snapshot_download
import torch
import argparse
import torch.nn.functional as F
import time

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default="zhouhui/DrBERT-7GB",
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_path = args.model_name_or_path

    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    #device = "cpu"
    start_time = time.time()     
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    model = AutoModelForSequenceClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

    premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
    hypothesis = "The movie was good."

    input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
    output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
    prediction = torch.softmax(output["logits"][0], -1).tolist()
    label_names = ["entailment", "neutral", "contradiction"]
    prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
    print(prediction)
    end_time = time.time()
    print(f"硬件环境：{device},推理执行时间：{end_time - start_time}秒")


if __name__ == "__main__":
    main()

执行掩码填充任务：

from transformers import pipeline 

fill_mask  = pipeline("fill-mask", model="Dr-BERT/DrBERT-7GB", tokenizer="Dr-BERT/DrBERT-7GB")
results = fill_mask("La patiente est atteinte d'une <mask>")

3. 使用 HuggingFace Transformers 库从头开始预训练 DrBERT 分词器和模型

3.1 安装依赖项

accelerate @ git+https://github.com/huggingface/accelerate@66edfe103a0de9607f9b9fdcf6a8e2132486d99b
datasets==2.6.1
sentencepiece==0.1.97
protobuf==3.20.1
evaluate==0.2.2
tensorboard==2.11.0
torch >= 1.3

3.2 下载 NACHOS 数据集文本文件

从 Zenodo 下载完整的 NACHOS 数据集，并将其放置在 from_scratch 或 continued_pretraining 目录中。

3.3 基于 NACHOS 从头构建自己的分词器

注意：此步骤仅在从头开始预训练时需要。如果您要进行持续预训练，只需下载与您要继续训练的模型相对应的模型和分词器。在这种情况下，您只需访问 HuggingFace Hub，选择一个模型（例如 RoBERTa-base）。最后，通过点击 Use In Transformers 按钮获取 Git 链接 git clone https://huggingface.co/roberta-base，下载整个模型/分词器仓库。

使用 ./build_tokenizer.sh 基于 ./corpus.txt 文件中的数据从头构建分词器。

3.4 数据集的预处理和分词

首先，替换 shell 脚本中的 tokenizer_path 字段，使其与您之前通过 HuggingFace Git 下载的或自己构建的分词器目录路径匹配。

运行 ./preprocessing_dataset.sh，使用指定的分词器生成分词后的数据集。

3.5 模型训练

首先，在名为 run_training.sh 的 shell 脚本中，根据您的计算能力修改所需的 GPU 数量 --ntasks=128。在我们的案例中，我们使用了来自 32 个节点（每个节点 4 个 GPU）的 128 个 V100 32 GB GPU（--ntasks-per-node=4 和 --gres=gpu:4），训练时长为 20 小时（--time=20:00:00）。

如果您使用 Jean Zay，还需要更改 -A 标志，使其与您的一个能够运行作业的 @gpu 配置文件匹配。您还需要将所有数据集、分词器、脚本和输出移动到 $SCRATCH 磁盘空间，以避免其他用户遇到 IO 问题。

3.5.1 从头开始预训练

更新 SLURM 参数后，您必须在 --model_type="camembert" 标志中更改模型架构的名称，并根据您尝试训练的架构规格更新 --config_overrides=。在我们的案例中，RoBERTa 的序列长度为 514，词汇量为 32005（分词器的 32K 个 token 和模型架构的 5 个 token），句子开始 token（BOS）和句子结束 token（EOS）的标识符分别为 5 和 6。更改

然后，进入 ./from_scratch/ 目录。

运行 sbatch ./run_training.sh 将训练作业提交到 SLURM 队列。

3.5.2 继续预训练

更新 SLURM 参数后，您需要将您想要从中开始的模型/分词器路径 --model_name_or_path= / --tokenizer_name= 修改为 3.3 节中从 HuggingFace 的 Git 下载的模型路径。

然后，进入 ./continued_pretraining/ 目录。

运行 sbatch ./run_training.sh 将训练任务提交到 SLURM 队列。

4. 下游任务微调

您只需在 HuggingFace 团队提供的任何示例中，将模型名称更改为 Dr-BERT/DrBERT-7GB 即可，示例链接：此处。

引用 BibTeX

@inproceedings{labrak2023drbert,
    title = {{DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains}},
    author = {Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Rouvier, Mickael and Morin, Emmanuel and Daille, Béatrice and Gourraud, Pierre-Antoine},
    booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL'23), Long Paper},
    month = july,
    year = 2023,
    address = {Toronto, Canada},
    publisher = {Association for Computational Linguistics}
}