VersaViT 是一款经过优化的视觉Transformer,旨在作为多模态系统中高效的通用视觉编码器。它通过多任务协同后训练方法进行精调。VersaViT既适用于语言介导的推理(例如,与LLM结合实现视觉-语言理解),也适用于像素级理解(例如,分割和深度探测)。
import torch
from PIL import Image
from transformers import AutoImageProcessor
from models.versavit import VersaViTPretrainedModel
model_path = 'tencent/VersaViT'
processor = AutoImageProcessor.from_pretrained(model_path)
model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')
image = Image.open("./assets/versavit_logo.png")
inputs = processor(images=image, return_tensors="pt").to('cuda')
outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])如果您在研究或项目中使用本模型,请引用:
@article{liu2026versavit,
title={VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization},
author={Liu, Yikun and Liu, Yuan and Di, Shangzhe and Wang, Haicheng and Zhao, Zhongyin and Tian, Le and Zhou, Xiao and Zhou, Jie and Yao, Jiangchao and Wang, Yanfeng and others},
journal={arXiv preprint arXiv:2602.09934},
year={2026}
}