CLIP

[博客] [论文] [模型卡片] [Colab]

CLIP（对比语言-图像预训练）是一个在多种（图像、文本）对上训练的神经网络。它可以通过自然语言指令，在给定图像的情况下预测最相关的文本片段，而无需直接针对该任务进行优化，这与GPT-2和GPT-3的零样本能力类似。我们发现，CLIP在ImageNet上的“零样本”性能与原始ResNet50相当，且未使用任何原始的128万标记样本，从而克服了计算机视觉领域的若干主要挑战。

方法

CLIP

完整迁移指导书

参见 https://ai.atomgit.com/Ascend-SACT/clip-vit-base-patch32/blob/main/reports/CLIP_%E9%80%82%E9%85%8D%E6%8C%87%E5%AF%BC%E4%B9%A6.md

使用方法

首先，安装PyTorch 1.7.1（或更高版本）和torchvision，以及少量其他依赖项，然后将此仓库作为Python包安装。在配备CUDA GPU的机器上，以下命令可完成安装：

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git

将上述 cudatoolkit=11.0 替换为您机器上相应的 CUDA 版本；如果在没有 GPU 的机器上安装，则替换为 cpuonly。

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

API

CLIP 模块 clip 提供以下方法：

`clip.available_models()`

返回可用的 CLIP 模型名称。

`clip.load(name, device=..., jit=False)`

根据 clip.available_models() 返回的模型名称，返回模型以及该模型所需的 TorchVision 变换。必要时会下载模型。name 参数也可以是本地检查点的路径。

可以选择指定运行模型的设备，默认情况下，如果有 CUDA 设备，则使用第一个 CUDA 设备，否则使用 CPU。当 jit 为 False 时，将加载非 JIT 版本的模型。

`clip.tokenize(text: Union[str, List[str]], context_length=77)`

返回一个包含给定文本输入的标记化序列的 LongTensor。这可用作模型的输入。

clip.load() 返回的模型支持以下方法：

`model.encode_image(image: Tensor)`

给定一批图像，返回由 CLIP 模型的视觉部分编码的图像特征。

`model.encode_text(text: Tensor)`

给定一批文本标记，返回由 CLIP 模型的语言部分编码的文本特征。

`model(image: Tensor, text: Tensor)`

给定一批图像和一批文本标记，返回两个 Tensor，包含与每个图像和文本输入对应的 logit 分数。这些值是相应图像和文本特征之间的余弦相似度乘以 100。

更多示例

零样本预测

以下代码使用 CLIP 执行零样本预测，如论文附录 B 所示。此示例从 CIFAR-100 数据集中获取一张图像，并从该数据集的 100 个文本标签中预测最可能的标签。

import os
import clip
import torch
from torchvision.datasets import CIFAR100

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)

# Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

输出结果如下（具体数字可能因计算设备不同而略有差异）：

Top predictions:

           snake: 65.31%
          turtle: 12.29%
    sweet_pepper: 3.83%
          lizard: 1.88%
       crocodile: 1.75%

请注意，本示例使用 encode_image() 和 encode_text() 方法，这些方法会返回给定输入的编码特征。

线性探针评估

以下示例使用 scikit-learn 对图像特征执行逻辑回归。

import os
import clip
import torch

import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)


def get_features(dataset):
    all_features = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))

            all_features.append(features)
            all_labels.append(labels)

    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)

# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)

# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {accuracy:.3f}")

请注意，C 值应通过使用验证集进行超参数搜索来确定。

另请参阅

OpenCLIP：包含更大规模且独立训练的 CLIP 模型，最大支持 ViT-G/14
CLIP 的 Hugging Face 实现：便于与 HF 生态系统集成

CLIP

[博客] [论文] [模型卡片] [Colab]

方法

CLIP

完整迁移指导书

参见 https://ai.atomgit.com/Ascend-SACT/clip-vit-base-patch32/blob/main/reports/CLIP_%E9%80%82%E9%85%8D%E6%8C%87%E5%AF%BC%E4%B9%A6.md

使用方法

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git

将上述 cudatoolkit=11.0 替换为您机器上相应的 CUDA 版本；如果在没有 GPU 的机器上安装，则替换为 cpuonly。

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

API

CLIP 模块 clip 提供以下方法：

`clip.available_models()`

返回可用的 CLIP 模型名称。

`clip.load(name, device=..., jit=False)`

根据 clip.available_models() 返回的模型名称，返回模型以及该模型所需的 TorchVision 变换。必要时会下载模型。name 参数也可以是本地检查点的路径。

可以选择指定运行模型的设备，默认情况下，如果有 CUDA 设备，则使用第一个 CUDA 设备，否则使用 CPU。当 jit 为 False 时，将加载非 JIT 版本的模型。

`clip.tokenize(text: Union[str, List[str]], context_length=77)`

返回一个包含给定文本输入的标记化序列的 LongTensor。这可用作模型的输入。

clip.load() 返回的模型支持以下方法：

`model.encode_image(image: Tensor)`

给定一批图像，返回由 CLIP 模型的视觉部分编码的图像特征。

`model.encode_text(text: Tensor)`

给定一批文本标记，返回由 CLIP 模型的语言部分编码的文本特征。

`model(image: Tensor, text: Tensor)`

给定一批图像和一批文本标记，返回两个 Tensor，包含与每个图像和文本输入对应的 logit 分数。这些值是相应图像和文本特征之间的余弦相似度乘以 100。

更多示例

零样本预测

import os
import clip
import torch
from torchvision.datasets import CIFAR100

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)

# Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

输出结果如下（具体数字可能因计算设备不同而略有差异）：

Top predictions:

           snake: 65.31%
          turtle: 12.29%
    sweet_pepper: 3.83%
          lizard: 1.88%
       crocodile: 1.75%

请注意，本示例使用 encode_image() 和 encode_text() 方法，这些方法会返回给定输入的编码特征。

线性探针评估

以下示例使用 scikit-learn 对图像特征执行逻辑回归。

import os
import clip
import torch

import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)


def get_features(dataset):
    all_features = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))

            all_features.append(features)
            all_labels.append(labels)

    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)

# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)

# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {accuracy:.3f}")

请注意，C 值应通过使用验证集进行超参数搜索来确定。

另请参阅

OpenCLIP：包含更大规模且独立训练的 CLIP 模型，最大支持 ViT-G/14
CLIP 的 Hugging Face 实现：便于与 HF 生态系统集成