OWL-ViT(全称为Open-World Localization Vision Transformer,即开放世界定位视觉Transformer)由Matthias Minderer、Alexey Gritsenko、Austin Stone、Maxim Neumann、Dirk Weissenborn、Alexey Dosovitskiy、Aravindh Mahendran、Anurag Arnab、Mostafa Dehghani、Zhuoran Shen、Xiao Wang、Xiaohua Zhai、Thomas Kipf和Neil Houlsby在论文《Simple Open-Vocabulary Object Detection with Vision Transformers》(https://arxiv.org/abs/2205.06230)中提出。OWL-ViT是一种零样本文本条件目标检测模型,可用于通过一个或多个文本查询来查询图像。
OWL-ViT采用CLIP作为其多模态主干网络,使用类ViT的Transformer提取视觉特征,并使用因果语言模型提取文本特征。为了将CLIP应用于检测任务,OWL-ViT移除了视觉模型的最终令牌池化层,并在每个Transformer输出令牌上附加了一个轻量级的分类头和框头。通过将固定的分类层权重替换为从文本模型获得的类名嵌入,实现了开放词汇表分类。作者首先从头开始训练CLIP,然后在标准检测数据集上,使用 bipartite matching loss 对分类头和框头进行端到端的微调。每张图像可以使用一个或多个文本查询来执行零样本文本条件目标检测。
2022年5月
该模型使用CLIP主干网络,其中图像编码器采用ViT-B/32 Transformer架构,文本编码器采用带掩码自注意力的Transformer。这些编码器通过对比损失训练,以最大化(图像、文本)对的相似度。CLIP主干网络从头开始训练,并与框预测头和类别预测头一起,使用目标检测目标进行微调。
# Copyright (c) Huawei Technologies Co., Ltd. 2024, All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import requests
from PIL import Image
import mindspore
from mindnlp.transformers import OwlViTProcessor, OwlViTForObjectDetection
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_name_or_path",
type=str,
help="Path to model",
default="zhouhui/owlvit-base-patch32",
)
args = parser.parse_args()
return args
def main():
args = parse_args()
model_path = args.model_name_or_path
processor = OwlViTProcessor.from_pretrained(model_path)
model = OwlViTForObjectDetection.from_pretrained(model_path)
#url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image_path ="000000039769.jpg"
image = Image.open(image_path)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="ms")
outputs = model(**inputs)
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = mindspore.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)
i = 0 # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
box = [round(i, 2) for i in box.tolist()]
print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
if __name__ == "__main__":
main()该模型旨在作为面向研究社区的研究成果。我们希望此模型能帮助研究人员更好地理解和探索零样本、文本条件下的目标检测。我们也希望它能用于此类模型潜在影响的跨学科研究,特别是在那些通常需要识别训练期间标签不可用的目标的领域。
这些模型的主要预期用户是人工智能研究人员。
我们主要设想研究人员会使用该模型来更好地理解计算机视觉模型的鲁棒性、泛化能力以及其他性能、偏差和局限性。
模型的CLIP主干网络是在公开可用的图像-文本数据上训练的。这是通过爬取少数网站以及使用常用的现有图像数据集(如YFCC100M)相结合的方式完成的。大部分数据来自我们对互联网的爬取,这意味着这些数据更能代表与互联网连接最紧密的人群和社会。OWL-ViT的预测头与CLIP主干网络一起,在公开可用的目标检测数据集(如COCO和OpenImages)上进行了微调。
@article{minderer2022simple,
title={Simple Open-Vocabulary Object Detection with Vision Transformers},
author={Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby},
journal={arXiv preprint arXiv:2205.06230},
year={2022},
}