RAD-DINO 模型卡片

模型描述

RAD-DINO 是一种基于视觉变换器（vision transformer）的模型，采用自监督学习方法 DINOv2 对胸部 X 光影像进行编码训练。

该模型的详细技术说明发表于论文《探索超越文本监督的可扩展医学影像编码器》（F. Pérez-García, H. Sharma, S. Bond-Taylor 等，2024），详见 Nature 期刊文章。

开发团队： Microsoft Health Futures
模型类型： 视觉变换器
许可协议： MSRLA
基础模型： dinov2-base

使用范围

RAD-DINO 仅限研究用途，不可用于临床诊断。

本模型可作为视觉骨干网络接入下游任务模型，典型应用场景包括：

图像分类（基于 CLS 标记训练分类器）
图像分割（基于图像块标记训练解码器）
聚类分析（直接使用图像嵌入向量）
图像检索（通过 CLS 标记的近邻搜索）
报告生成（结合语言模型进行文本解码）

在多数下游任务中，无需微调即可获得优异性能。

偏差、风险与局限性

由于训练数据仅来自三个国家，RAD-DINO 可能存在训练数据人群的偏向性。训练数据集中的潜在偏差尚未得到充分量化分析。

开始使用

获取数据

首先，让我们编写一个辅助函数来下载胸部X光片。

>>> import requests
>>> from PIL import Image
>>> def download_sample_image() -> Image.Image:
...     """Download chest X-ray with CC license."""
...     base_url = "https://upload.wikimedia.org/wikipedia/commons"
...     image_url = f"{base_url}/2/20/Chest_X-ray_in_influenza_and_Haemophilus_influenzae.jpg"
...     headers = {"User-Agent": "RAD-DINO"}
...     response = requests.get(image_url, headers=headers, stream=True)
...     return Image.open(response.raw)
...

加载模型

现在让我们下载模型并对图像进行编码。

>>> import torch
>>> from transformers import AutoModel
>>> from transformers import AutoImageProcessor
>>>
>>> # Download the model
>>> repo = "microsoft/rad-dino"
>>> rad_dino = AutoModel.from_pretrained(repo)
>>>
>>> # The processor takes a PIL image, performs resizing, center-cropping, and
>>> # intensity normalization using stats from MIMIC-CXR, and returns a
>>> # dictionary with a PyTorch tensor ready for the encoder
>>> processor = AutoImageProcessor.from_pretrained(repo)

对图像进行编码

>>> # Download and preprocess a chest X-ray
>>> image = download_sample_image()
>>> image.size  # (width, height)
(2765, 2505)
>>> inputs = processor(images=image, return_tensors="pt")
>>>
>>> # Encode the image!
>>> with torch.inference_mode():
>>>     outputs = rad_dino(**inputs)
>>>
>>> # Look at the CLS embeddings
>>> cls_embeddings = outputs.pooler_output
>>> cls_embeddings.shape  # (batch_size, num_channels)
torch.Size([1, 768])

如果我们对特征图感兴趣，可以将补丁嵌入重塑为网格形式。这里我们将使用einops（通过pip install einops安装）来实现这一操作。

>>> def reshape_patch_embeddings(flat_tokens: torch.Tensor) -> torch.Tensor:
...     """Reshape flat list of patch tokens into a nice grid."""
...     from einops import rearrange
...     image_size = processor.crop_size["height"]
...     patch_size = model.config.patch_size
...     embeddings_size = image_size // patch_size
...     patches_grid = rearrange(flat_tokens, "b (h w) c -> b c h w", h=embeddings_size)
...     return patches_grid
...
>>> flat_patch_embeddings = outputs.last_hidden_state[:, 1:]  # first token is CLS
>>> reshaped_patch_embeddings = reshape_patch_embeddings(flat_patch_embeddings)
>>> reshaped_patch_embeddings.shape  # (batch_size, num_channels, height, width)
torch.Size([1, 768, 37, 37])

微调权重

我们发布了与原始DINOv2代码兼容的检查点，以帮助研究人员对我们的模型进行微调。

首先，让我们编写代码来加载safetensors检查点。

>>> import safetensors
>>> def safetensors_to_state_dict(checkpoint_path: str) -> dict[str, torch.Tensor]:
...     state_dict = {}
...     with safe_open(checkpoint_path, framework="pt") as ckpt_file:
...         for key in ckpt_file.keys():
...             state_dict[key] = ckpt_file.get_tensor(key)
...     return state_dict
...

现在我们可以使用集线器模型并加载RAD-DINO权重。克隆DINOv2代码库以便导入头部相关代码。

git clone https://github.com/facebookresearch/dinov2.git
cd dinov2

>>> import torch
>>> rad_dino_gh = torch.hub.load(".", "dinov2_vitb14")
>>> backbone_state_dict = safetensors_to_state_dict("backbone_compatible.safetensors")
>>> rad_dino_gh.load_state_dict(backbone_state_dict, strict=True)
<All keys matched successfully>

头部的权重也已发布：

>>> from dinov2.layers import DINOHead
>>> rad_dino_head_gh = DINOHead(
...    in_dim=768,
...    out_dim=65536,
...    hidden_dim=2048,
...    bottleneck_dim=256,
...    nlayers=3,
... )
>>> head_state_dict = safetensors_to_state_dict("dino_head.safetensors")
>>> rad_dino_head_gh.load_state_dict(head_state_dict, strict=True)
<All keys matched successfully>

配置与数据增强

配置文件 ssl_default_config.yaml 和 vitb14_cxr.yaml，以及数据增强模块 augmentations 已开源供研究人员使用，可帮助复现我们采用的超参数训练流程。

训练细节

训练数据

本次发布的 RAD-DINO 检查点使用五个公开的去标识化胸部X光数据集进行训练：

数据集	图像数量
MIMIC-CXR	368,960
CheXpert	223,648
NIH-CXR	112,120
PadChest	136,787
BRAX	41,260
总计	882,775

用于训练 MAIRA 的验证集和测试集中的图像已从 RAD-DINO 训练集中排除。训练所用图像文件列表详见 ./training_images.csv。

请注意本检查点与论文版本存在差异：论文版本使用了部分私有数据（且GPU数量较少），而此处发布的检查点经过35,000次迭代训练（总迭代次数为100,000次，我们通过论文所述评估数据集验证集上的线性探测评估选择了当前检查点）。训练采用16个节点，每个节点配备4块A100 GPU，每块GPU的批处理大小为40张图像。

训练流程

训练流程的详细说明请参阅论文原文。

预处理

所有DICOM文件均采用B样条插值法调整尺寸，使其短边长度为518像素，通过最小-最大值缩放至[0, 255]范围，并存储为PNG格式。

训练超参数

训练模式: 采用PyTorch-FSDP混合精度的fp16训练

评估

我们的评估方法在论文手稿中有详细说明。

环境影响

硬件类型： NVIDIA A100 GPU
使用时长： 40小时/GPU × 16节点 × 4 GPU/节点 = 2560 GPU小时
云服务商： Azure
计算区域： 美国西部2区
碳排放量： 222千克二氧化碳当量

计算基础设施

RAD-DINO在Azure机器学习平台上完成训练。

硬件配置

我们使用了16个Standard_NC96ads_A100_v4节点，每个节点配备四块NVIDIA A100（80 GB）GPU。

软件配置

训练代码基于DINOv2实现。 DICOM文件处理采用SimpleITK和Pydicom工具库。

引用

BibTeX：

@article{perez-garcia_exploring_2025,
	title = {Exploring scalable medical image encoders beyond text supervision},
	issn = {2522-5839},
	url = {https://doi.org/10.1038/s42256-024-00965-w},
	doi = {10.1038/s42256-024-00965-w},
	journal = {Nature Machine Intelligence},
	author = {P{\'e}rez-Garc{\'i}a, Fernando and Sharma, Harshita and Bond-Taylor, Sam and Bouzid, Kenza and Salvatelli, Valentina and Ilse, Maximilian and Bannur, Shruthi and Castro, Daniel C. and Schwaighofer, Anton and Lungren, Matthew P. and Wetscherek, Maria Teodora and Codella, Noel and Hyland, Stephanie L. and Alvarez-Valle, Javier and Oktay, Ozan},
	month = jan,
	year = {2025},
}

APA格式引用：

Pérez-García, F., Sharma, H., Bond-Taylor, S., Bouzid, K., Salvatelli, V., Ilse, M., Bannur, S., Castro, D. C., Schwaighofer, A., Lungren, M. P., Wetscherek, M. T., Codella, N., Hyland, S. L., Alvarez-Valle, J., & Oktay, O. (2025). 探索超越文本监督的可扩展医学图像编码器. 发表于《自然·机器智能》. Springer Science and Business Media LLC. https://doi.org/10.1038/s42256-024-00965-w

模型卡片联系人

Fernando Pérez-García (fperezgarcia@microsoft.com)。