Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov
[BibTex](#citing-radio)
您可以通过 Python 脚本拉取模型:
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
hf_repo = "nvidia/RADIO-B"
image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()
image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()
summary, features = model(pixel_values)RADIO 将返回包含两个张量的元组。其中 summary 类似于 ViT 中的 cls_token,用于表征图像的整体概念,其维度为 , 表示批处理维度, 为通道数。而 spatial_features 则代表更局部化的特征内容,适用于语义分割等密集预测任务或与大型语言模型(LLM)的集成,其维度为 , 表示展平后的空间令牌数量, 为空间特征的通道数。请注意,通常 与 的数值并不相等。
若需转换为空间张量格式,可结合模型的下采样尺寸与输入张量形状进行计算。对于 'radio_v1' 模型,其图像块(patch)大小为 14。
from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)最终张量的形状将为,这符合计算机视觉模型的常见设定。
RADIO代码及权重遵循NSCLv1许可证发布。
如果您认为本资源库对您的研究有所帮助,请考虑给予星标支持并引用:
@InProceedings{Ranzinger_2024_CVPR,
author = {Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo},
title = {AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {12490-12500}
}@misc{ranzinger2024phisdistributionbalancinglabelfree,
title={PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation},
author={Mike Ranzinger and Jon Barker and Greg Heinrich and Pavlo Molchanov and Bryan Catanzaro and Andrew Tao},
year={2024},
eprint={2410.01680},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.01680},
}