Web-SSL MAE ViT-H (700M)：2B MetaCLIP 数据，224 分辨率

一个拥有 7 亿参数的视觉Transformer（ViT-H），在无语言监督的网络级图像数据上通过掩码自编码器（MAE）自监督学习进行训练。该模型在《Scaling Language-Free Visual Representation Learning》（Fan 等人，2025）中被提出。

模型详情

架构：ViT-H（Huge）
参数：700M
分辨率：224×224 像素
训练：在来自 MetaCLIP 网络数据的 2B 图像样本上进行自监督 Web-MAE 训练

模型描述

Web-SSL MAE ViT-H 是一个拥有 7 亿参数的视觉Transformer模型，在 20 亿张无语言监督的网络图像上通过掩码自编码器自监督学习进行训练。该模型表明，当纯视觉学习得到适当扩展时，其在各种视觉任务上的性能可以达到甚至超过 CLIP 等语言监督模型。Web-MAE 在 OCR 和图表理解任务上表现尤为突出，同时在传统视觉基准测试和多模态任务上也保持着竞争力。

用法

from transformers import AutoImageProcessor, ViTModel
import torch
from PIL import Image

# Adjust the size, crop_size, etc. fields to your liking
processor = AutoImageProcessor.from_pretrained('facebook/webssl-mae700m-full2b-224')
model = ViTModel.from_pretrained('facebook/webssl-mae700m-full2b-224').cuda().eval()

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt").to('cuda')
with torch.no_grad():
    outputs = model(**inputs)

# Extract features from the encoder
encoder_hidden_states = outputs.last_hidden_state

引用格式

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Web-SSL MAE ViT-H (700M)：2B MetaCLIP 数据，224 分辨率

模型详情

架构：ViT-H（Huge）
参数：700M
分辨率：224×224 像素
训练：在来自 MetaCLIP 网络数据的 2B 图像样本上进行自监督 Web-MAE 训练

模型描述

用法

from transformers import AutoImageProcessor, ViTModel
import torch
from PIL import Image

# Adjust the size, crop_size, etc. fields to your liking
processor = AutoImageProcessor.from_pretrained('facebook/webssl-mae700m-full2b-224')
model = ViTModel.from_pretrained('facebook/webssl-mae700m-full2b-224').cuda().eval()

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt").to('cuda')
with torch.no_grad():
    outputs = model(**inputs)

# Extract features from the encoder
encoder_hidden_states = outputs.last_hidden_state

引用格式

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}