VersaViT:可用于构建多模态系统的通用视觉编码器，适用于语言介导推理和像素级理解任务。采用多任务协作后训练方法优化，支持视觉-语言理解、分割及深度探测等功能。【此简介由AI生成】

🌟 模型概述

VersaViT 是一款经过优化的视觉Transformer，旨在作为多模态系统中高效的通用视觉编码器。它通过多任务协同后训练方法进行精调。VersaViT既适用于语言介导的推理（例如，与LLM结合实现视觉-语言理解），也适用于像素级理解（例如，分割和深度探测）。

快速开始

import torch
from PIL import Image
from transformers import AutoImageProcessor
from models.versavit import VersaViTPretrainedModel


model_path = 'tencent/VersaViT'
processor = AutoImageProcessor.from_pretrained(model_path)
model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')

image = Image.open("./assets/versavit_logo.png")
inputs = processor(images=image, return_tensors="pt").to('cuda')
outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])

引用说明

如果您在研究或项目中使用本模型，请引用：

@article{liu2026versavit,
  title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization},
  author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others},
  journal={arXiv preprint arXiv:2602.09934},
  year={2026}
}

🌟 模型概述

快速开始

import torch
from PIL import Image
from transformers import AutoImageProcessor
from models.versavit import VersaViTPretrainedModel


model_path = 'tencent/VersaViT'
processor = AutoImageProcessor.from_pretrained(model_path)
model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')

image = Image.open("./assets/versavit_logo.png")
inputs = processor(images=image, return_tensors="pt").to('cuda')
outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])

引用说明

如果您在研究或项目中使用本模型，请引用：

@article{liu2026versavit,
  title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization},
  author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others},
  journal={arXiv preprint arXiv:2602.09934},
  year={2026}
}