TIPSv2 — B/14

TIPSv2（具备空间感知能力的文本-图像预训练模型）是一系列对比式视觉-语言模型，能够生成与文本嵌入对齐的空间丰富图像特征。本版本为基础变体，包含86M视觉参数和110M文本参数。您可以尝试以下代码片段，或查看GitHub仓库获取更多使用场景和可视化效果，包括零样本分割功能。

变体	视觉参数	文本参数	嵌入维度	DPT Heads
B/14	86M	110M	768	B/14-dpt
L/14	303M	184M	1024	L/14-dpt
SO400m/14	412M	448M	1152	SO400m/14-dpt
g/14	1.1B	389M	1536	g/14-dpt

使用方法

pip install transformers torch torchvision sentencepiece scikit-learn

加载模型

from transformers import AutoModel

model = AutoModel.from_pretrained("google/tipsv2-b14", trust_remote_code=True)
model.eval()

图像编码

图像应是处于 [0, 1] 范围内的张量（仅需 ToTensor()，无需 ImageNet 归一化）。

from torchvision import transforms
from PIL import Image
import requests

transform = transforms.Compose([
    transforms.Resize((448, 448)),
    transforms.ToTensor(),
])

url = "https://huggingface.co/spaces/google/TIPSv2/resolve/main/examples/zeroseg/pascal_context_00049_image.png"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = transform(image).unsqueeze(0)
out = model.encode_image(pixel_values)

print(out.cls_token.shape)     # (1, 1, 768) — global image embedding
print(out.patch_tokens.shape)  # (1, 1024, 768) — per-patch spatial features

编码文本

text_emb = model.encode_text(["a photo of a bus", "a photo of a dog"])
print(text_emb.shape)  # (2, 768) — one embedding per query

零样本分类

import torch.nn.functional as F

classes = ["bus", "car", "dog", "cat"]
cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
text_emb = F.normalize(model.encode_text(classes), dim=-1)
similarity = cls @ text_emb.T
print(classes[similarity.argmax()])  # bus — predicted class

可视化空间特征

import numpy as np
from sklearn.decomposition import PCA

spatial = out.patch_tokens.reshape(1, 32, 32, 768)
feat = spatial[0].detach().cpu().numpy().reshape(-1, 768)
rgb = PCA(n_components=3, whiten=True).fit_transform(feat).reshape(32, 32, 3)
rgb = 1 / (1 + np.exp(-2.0 * rgb))  # sigmoid for [0, 1] range with good contrast
print(rgb.shape)  # (32, 32, 3) — PCA of patch features as RGB

GPU 推理

model = model.cuda()
out = model.encode_image(pixel_values.cuda())
text_emb = model.encode_text(["a city"])

模型详情

架构：ViT 视觉编码器（12 层）+ Transformer 文本编码器（12 层）
图像预处理：调整为任意分辨率，转换为 [0, 1]（无 ImageNet 归一化）
文本预处理：SentencePiece 分词器，小写处理，最大 64 个 token
** patch 大小**：14x14 像素

许可证

Apache 2.0

引用

@inproceedings{cao2026tipsv2,
  title     = {{TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment}},
  author    = {Cao, Bingyi and Chen, Koert and Maninis, Kevis-Kokitsi and Chen, Kaifeng and Karpur, Arjun and Xia, Ye and Dua, Sahil and Dabral, Tanmaya and Han, Guangxing and Han, Bohyung and Ainslie, Joshua and Bewley, Alex and Jacob, Mithun and Wagner, Rene and Ramos, Washington and Choromanski, Krzysztof and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andre},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}