TIPSv2(具备空间感知能力的文本-图像预训练模型)是一系列对比式视觉-语言模型,能够生成与文本嵌入对齐的空间丰富图像特征。本版本为基础变体,包含86M视觉参数和110M文本参数。您可以尝试以下代码片段,或查看GitHub仓库获取更多使用场景和可视化效果,包括零样本分割功能。
| 变体 | 视觉参数 | 文本参数 | 嵌入维度 | DPT Heads |
|---|---|---|---|---|
| B/14 | 86M | 110M | 768 | B/14-dpt |
| L/14 | 303M | 184M | 1024 | L/14-dpt |
| SO400m/14 | 412M | 448M | 1152 | SO400m/14-dpt |
| g/14 | 1.1B | 389M | 1536 | g/14-dpt |
pip install transformers torch torchvision sentencepiece scikit-learnfrom transformers import AutoModel
model = AutoModel.from_pretrained("google/tipsv2-b14", trust_remote_code=True)
model.eval()图像应是处于 [0, 1] 范围内的张量(仅需 ToTensor(),无需 ImageNet 归一化)。
from torchvision import transforms
from PIL import Image
import requests
transform = transforms.Compose([
transforms.Resize((448, 448)),
transforms.ToTensor(),
])
url = "https://huggingface.co/spaces/google/TIPSv2/resolve/main/examples/zeroseg/pascal_context_00049_image.png"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = transform(image).unsqueeze(0)
out = model.encode_image(pixel_values)
print(out.cls_token.shape) # (1, 1, 768) — global image embedding
print(out.patch_tokens.shape) # (1, 1024, 768) — per-patch spatial featurestext_emb = model.encode_text(["a photo of a bus", "a photo of a dog"])
print(text_emb.shape) # (2, 768) — one embedding per queryimport torch.nn.functional as F
classes = ["bus", "car", "dog", "cat"]
cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
text_emb = F.normalize(model.encode_text(classes), dim=-1)
similarity = cls @ text_emb.T
print(classes[similarity.argmax()]) # bus — predicted classimport numpy as np
from sklearn.decomposition import PCA
spatial = out.patch_tokens.reshape(1, 32, 32, 768)
feat = spatial[0].detach().cpu().numpy().reshape(-1, 768)
rgb = PCA(n_components=3, whiten=True).fit_transform(feat).reshape(32, 32, 3)
rgb = 1 / (1 + np.exp(-2.0 * rgb)) # sigmoid for [0, 1] range with good contrast
print(rgb.shape) # (32, 32, 3) — PCA of patch features as RGBmodel = model.cuda()
out = model.encode_image(pixel_values.cuda())
text_emb = model.encode_text(["a city"])[0, 1](无 ImageNet 归一化)Apache 2.0
@inproceedings{cao2026tipsv2,
title = {{TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment}},
author = {Cao, Bingyi and Chen, Koert and Maninis, Kevis-Kokitsi and Chen, Kaifeng and Karpur, Arjun and Xia, Ye and Dua, Sahil and Dabral, Tanmaya and Han, Guangxing and Han, Bohyung and Ainslie, Joshua and Bewley, Alex and Jacob, Mithun and Wagner, Rene and Ramos, Washington and Choromanski, Krzysztof and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andre},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}