一个拥有 7 亿参数的视觉Transformer(ViT-H),在无语言监督的网络级图像数据上通过掩码自编码器(MAE)自监督学习进行训练。该模型在《Scaling Language-Free Visual Representation Learning》(Fan 等人,2025)中被提出。
Web-SSL MAE ViT-H 是一个拥有 7 亿参数的视觉Transformer模型,在 20 亿张无语言监督的网络图像上通过掩码自编码器自监督学习进行训练。该模型表明,当纯视觉学习得到适当扩展时,其在各种视觉任务上的性能可以达到甚至超过 CLIP 等语言监督模型。Web-MAE 在 OCR 和图表理解任务上表现尤为突出,同时在传统视觉基准测试和多模态任务上也保持着竞争力。
from transformers import AutoImageProcessor, ViTModel
import torch
from PIL import Image
# Adjust the size, crop_size, etc. fields to your liking
processor = AutoImageProcessor.from_pretrained('facebook/webssl-mae700m-full2b-224')
model = ViTModel.from_pretrained('facebook/webssl-mae700m-full2b-224').cuda().eval()
# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt").to('cuda')
with torch.no_grad():
outputs = model(**inputs)
# Extract features from the encoder
encoder_hidden_states = outputs.last_hidden_state@article{fan2025scaling,
title={Scaling Language-Free Visual Representation Learning},
author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
year={2025},
eprint={2504.01017},
archivePrefix={arXiv},
primaryClass={cs.CV}
}