模型卡片：OWLv2

模型详情

OWLv2模型（开放世界定位的缩写）由Matthias Minderer、Alexey Gritsenko和Neil Houlsby在论文《扩展开放词汇目标检测》中提出。与OWL-ViT类似，OWLv2是一种零样本文本条件目标检测模型，可通过一个或多个文本查询对图像进行检索。

该模型采用CLIP作为多模态主干网络，使用类ViT的Transformer架构提取视觉特征，并通过因果语言模型获取文本特征。为将CLIP应用于检测任务，OWL-ViT移除了视觉模型的最终令牌池化层，并在每个Transformer输出令牌上附加轻量级的分类和边界框预测头。通过将固定分类层权重替换为从文本模型获取的类别名称嵌入向量，实现了开放词汇分类。作者首先从头开始训练CLIP，然后使用双向匹配损失在标准检测数据集上对分类和边界框预测头进行端到端微调。单张图像可使用一个或多个文本查询执行零样本文本条件目标检测。

模型日期

2023年6月

模型类型

本模型采用CLIP主干网络，其中图像编码器使用ViT-B/16 Transformer架构，文本编码器采用掩码自注意力Transformer。这些编码器通过对比损失训练以最大化（图像，文本）对的相似度。CLIP主干网络从头开始训练，并与边界框和类别预测头共同通过目标检测目标进行微调。

文档

OWLv2论文

与Transformers库配合使用

import requests
from PIL import Image
import numpy as np
import torch
from transformers import AutoProcessor, Owlv2ForObjectDetection
from transformers.utils.constants import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD

processor = AutoProcessor.from_pretrained("google/owlv2-base-patch16")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")

# forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Note: boxes need to be visualized on the padded, unnormalized image
# hence we'll set the target image sizes (height, width) based on that

def get_preprocessed_image(pixel_values):
    pixel_values = pixel_values.squeeze().numpy()
    unnormalized_image = (pixel_values * np.array(OPENAI_CLIP_STD)[:, None, None]) + np.array(OPENAI_CLIP_MEAN)[:, None, None]
    unnormalized_image = (unnormalized_image * 255).astype(np.uint8)
    unnormalized_image = np.moveaxis(unnormalized_image, 0, -1)
    unnormalized_image = Image.fromarray(unnormalized_image)
    return unnormalized_image

unnormalized_image = get_preprocessed_image(inputs.pixel_values)

target_sizes = torch.Tensor([unnormalized_image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to final bounding boxes and scores
results = processor.post_process_object_detection(
    outputs=outputs, threshold=0.2, target_sizes=target_sizes
)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

模型用途

预期用途

本模型旨在作为研究社区的研究成果输出。我们希望该模型能够帮助研究人员更好地理解和探索零样本、文本条件化的目标检测。同时也期待它能用于跨学科研究，探讨此类模型的潜在影响，特别是在通常需要识别训练期间标签不可用对象的领域。

主要预期用途

这些模型的主要预期用户是人工智能研究人员。

我们主要设想研究人员将通过该模型更好地理解计算机视觉模型的鲁棒性、泛化能力及其他特性，以及存在的偏见和限制。

数据

该模型的CLIP主干网络基于公开可用的图像-标题数据进行训练。这一过程结合了对部分网站的爬取数据以及常用现有图像数据集（如YFCC100M）。大部分数据来源于我们对互联网的爬取，这意味着数据更能代表与互联网连接最紧密的人群和社会群体。OWL-ViT的预测头与CLIP主干网络共同在公开目标检测数据集（如COCO和OpenImages）上进行了微调。

（v2版本待更新）

BibTeX条目与引用信息

@misc{minderer2023scaling,
      title={Scaling Open-Vocabulary Object Detection}, 
      author={Matthias Minderer and Alexey Gritsenko and Neil Houlsby},
      year={2023},
      eprint={2306.09683},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}