RAD-DINO 是一种基于视觉变换器(vision transformer)的模型,采用自监督学习方法 DINOv2 对胸部 X 光影像进行编码训练。
该模型的详细技术说明发表于论文《探索超越文本监督的可扩展医学影像编码器》(F. Pérez-García, H. Sharma, S. Bond-Taylor 等,2024),详见 Nature 期刊文章。
dinov2-baseRAD-DINO 仅限研究用途,不可用于临床诊断。
本模型可作为视觉骨干网络接入下游任务模型,典型应用场景包括:
CLS 标记训练分类器)在多数下游任务中,无需微调即可获得优异性能。
由于训练数据仅来自三个国家,RAD-DINO 可能存在训练数据人群的偏向性。训练数据集中的潜在偏差尚未得到充分量化分析。
首先,让我们编写一个辅助函数来下载胸部X光片。
>>> import requests
>>> from PIL import Image
>>> def download_sample_image() -> Image.Image:
... """Download chest X-ray with CC license."""
... base_url = "https://upload.wikimedia.org/wikipedia/commons"
... image_url = f"{base_url}/2/20/Chest_X-ray_in_influenza_and_Haemophilus_influenzae.jpg"
... headers = {"User-Agent": "RAD-DINO"}
... response = requests.get(image_url, headers=headers, stream=True)
... return Image.open(response.raw)
...现在让我们下载模型并对图像进行编码。
>>> import torch
>>> from transformers import AutoModel
>>> from transformers import AutoImageProcessor
>>>
>>> # Download the model
>>> repo = "microsoft/rad-dino"
>>> rad_dino = AutoModel.from_pretrained(repo)
>>>
>>> # The processor takes a PIL image, performs resizing, center-cropping, and
>>> # intensity normalization using stats from MIMIC-CXR, and returns a
>>> # dictionary with a PyTorch tensor ready for the encoder
>>> processor = AutoImageProcessor.from_pretrained(repo)>>> # Download and preprocess a chest X-ray
>>> image = download_sample_image()
>>> image.size # (width, height)
(2765, 2505)
>>> inputs = processor(images=image, return_tensors="pt")
>>>
>>> # Encode the image!
>>> with torch.inference_mode():
>>> outputs = rad_dino(**inputs)
>>>
>>> # Look at the CLS embeddings
>>> cls_embeddings = outputs.pooler_output
>>> cls_embeddings.shape # (batch_size, num_channels)
torch.Size([1, 768])如果我们对特征图感兴趣,可以将补丁嵌入重塑为网格形式。
这里我们将使用einops(通过pip install einops安装)来实现这一操作。
>>> def reshape_patch_embeddings(flat_tokens: torch.Tensor) -> torch.Tensor:
... """Reshape flat list of patch tokens into a nice grid."""
... from einops import rearrange
... image_size = processor.crop_size["height"]
... patch_size = model.config.patch_size
... embeddings_size = image_size // patch_size
... patches_grid = rearrange(flat_tokens, "b (h w) c -> b c h w", h=embeddings_size)
... return patches_grid
...
>>> flat_patch_embeddings = outputs.last_hidden_state[:, 1:] # first token is CLS
>>> reshaped_patch_embeddings = reshape_patch_embeddings(flat_patch_embeddings)
>>> reshaped_patch_embeddings.shape # (batch_size, num_channels, height, width)
torch.Size([1, 768, 37, 37])我们发布了与原始DINOv2代码兼容的检查点,以帮助研究人员对我们的模型进行微调。
首先,让我们编写代码来加载safetensors检查点。
>>> import safetensors
>>> def safetensors_to_state_dict(checkpoint_path: str) -> dict[str, torch.Tensor]:
... state_dict = {}
... with safe_open(checkpoint_path, framework="pt") as ckpt_file:
... for key in ckpt_file.keys():
... state_dict[key] = ckpt_file.get_tensor(key)
... return state_dict
...现在我们可以使用集线器模型并加载RAD-DINO权重。 克隆DINOv2代码库以便导入头部相关代码。
git clone https://github.com/facebookresearch/dinov2.git
cd dinov2>>> import torch
>>> rad_dino_gh = torch.hub.load(".", "dinov2_vitb14")
>>> backbone_state_dict = safetensors_to_state_dict("backbone_compatible.safetensors")
>>> rad_dino_gh.load_state_dict(backbone_state_dict, strict=True)
<All keys matched successfully>头部的权重也已发布:
>>> from dinov2.layers import DINOHead
>>> rad_dino_head_gh = DINOHead(
... in_dim=768,
... out_dim=65536,
... hidden_dim=2048,
... bottleneck_dim=256,
... nlayers=3,
... )
>>> head_state_dict = safetensors_to_state_dict("dino_head.safetensors")
>>> rad_dino_head_gh.load_state_dict(head_state_dict, strict=True)
<All keys matched successfully>配置文件 ssl_default_config.yaml 和 vitb14_cxr.yaml,以及数据增强模块 augmentations 已开源供研究人员使用,可帮助复现我们采用的超参数训练流程。
本次发布的 RAD-DINO 检查点使用五个公开的去标识化胸部X光数据集进行训练:
用于训练 MAIRA 的验证集和测试集中的图像已从 RAD-DINO 训练集中排除。训练所用图像文件列表详见 ./training_images.csv。
请注意本检查点与论文版本存在差异:论文版本使用了部分私有数据(且GPU数量较少),而此处发布的检查点经过35,000次迭代训练(总迭代次数为100,000次,我们通过论文所述评估数据集验证集上的线性探测评估选择了当前检查点)。训练采用16个节点,每个节点配备4块A100 GPU,每块GPU的批处理大小为40张图像。
训练流程的详细说明请参阅论文原文。
所有DICOM文件均采用B样条插值法调整尺寸,使其短边长度为518像素,通过最小-最大值缩放至[0, 255]范围,并存储为PNG格式。
我们的评估方法在论文手稿中有详细说明。
RAD-DINO在Azure机器学习平台上完成训练。
我们使用了16个Standard_NC96ads_A100_v4节点,每个节点配备四块NVIDIA A100(80 GB)GPU。
训练代码基于DINOv2实现。 DICOM文件处理采用SimpleITK和Pydicom工具库。
BibTeX:
@article{perez-garcia_exploring_2025,
title = {Exploring scalable medical image encoders beyond text supervision},
issn = {2522-5839},
url = {https://doi.org/10.1038/s42256-024-00965-w},
doi = {10.1038/s42256-024-00965-w},
journal = {Nature Machine Intelligence},
author = {P{\'e}rez-Garc{\'i}a, Fernando and Sharma, Harshita and Bond-Taylor, Sam and Bouzid, Kenza and Salvatelli, Valentina and Ilse, Maximilian and Bannur, Shruthi and Castro, Daniel C. and Schwaighofer, Anton and Lungren, Matthew P. and Wetscherek, Maria Teodora and Codella, Noel and Hyland, Stephanie L. and Alvarez-Valle, Javier and Oktay, Ozan},
month = jan,
year = {2025},
}APA格式引用:
Pérez-García, F., Sharma, H., Bond-Taylor, S., Bouzid, K., Salvatelli, V., Ilse, M., Bannur, S., Castro, D. C., Schwaighofer, A., Lungren, M. P., Wetscherek, M. T., Codella, N., Hyland, S. L., Alvarez-Valle, J., & Oktay, O. (2025). 探索超越文本监督的可扩展医学图像编码器. 发表于《自然·机器智能》. Springer Science and Business Media LLC. https://doi.org/10.1038/s42256-024-00965-w
Fernando Pérez-García (fperezgarcia@microsoft.com)。