ImageGPT（小型模型）

ImageGPT（iGPT）模型在 ImageNet ILSVRC 2012（1400 万张图像，21,843 个类别）上以 32x32 分辨率进行预训练。该模型由 Chen 等人在论文《Generative Pretraining from Pixels》中提出，并首次在此仓库发布。另请参阅官方博客文章。

免责声明：发布 ImageGPT 的团队未为此模型编写模型卡片，因此本模型卡片由 Hugging Face 团队编写。

模型描述

ImageGPT（iGPT）是一种 transformer 解码器模型（类 GPT），它以自监督的方式在大量图像集合（即 ImageNet-21k）上进行预训练，图像分辨率为 32x32 像素。

该模型的目标很简单，就是根据前面的像素值预测下一个像素值。

通过预训练，模型学习到图像的内部表示，然后可用于：

提取对下游任务有用的特征：可以使用 ImageGPT 生成固定的图像特征，以训练线性模型（如 sklearn 逻辑回归模型或 SVM）。这也称为“线性探测”。
执行（无）条件图像生成。

预期用途和局限性

您可以将原始模型用作特征提取器或（无）条件图像生成器。请参阅模型中心了解所有 ImageGPT 变体。

使用方法

以下是如何在 PyTorch 中使用此模型执行无条件图像生成：

from transformers import ImageGPTImageProcessor, ImageGPTForCausalImageModeling
import torch
import matplotlib.pyplot as plt
import numpy as np
import argparse
import io
from openmind import is_torch_npu_available


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to model",
        default=None,
    )
    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    model_name_or_path = args.model_name_or_path
    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    processor = ImageGPTImageProcessor.from_pretrained(model_name_or_path)
    model = ImageGPTForCausalImageModeling.from_pretrained(model_name_or_path)
    model.to(device)

    # unconditional generation of 8 images
    batch_size = 8
    context = torch.full((batch_size, 1), model.config.vocab_size - 1) #initialize with SOS token
    context = torch.tensor(context).to(device)
    output = model.generate(input_ids=context=context, max_length=model.config.n_positions + 1, temperature=1.0, do_sample=True, top_k=40)

    clusters = image_processor.clusters
    height = image_processor.size["height"]
    width = image_processor.size["width"]


    samples = output[:,1:].cpu().detach().numpy()
    samples_img = [np.reshape(np.rint(127.5 * (clusters[s] + 1.0)), [height, width, 3]).astype(np.uint8) for s in samples] # convert color cluster tokens back to pixels

    f, axes = plt.subplots(1, batch_size, dpi=300)
    for img, ax in zip(samples_img, axes):
        ax.axis('off')
        ax.imshow(img)

if __name__ == "__main__":
    main()

训练数据

ImageGPT 模型在 ImageNet-21k 上进行了预训练，该数据集包含 1400 万张图像和 21k 个类别。

训练过程

预处理

首先将图像调整/缩放至相同分辨率（32x32），并在 RGB 通道上进行归一化。接下来执行颜色聚类，即将每个像素转换为 512 种可能的聚类值之一。这样处理后，得到的是一个 32x32 = 1024 像素值的序列，而非 32x32x3 = 3072 像素值的序列，后者对于基于 Transformer 的模型而言规模过大，难以处理。

预训练

训练详情可参见论文第 2 版的 3.4 节。

评估结果

关于多个图像分类基准的评估结果，我们建议参考原始论文。

BibTeX 条目和引用信息

@InProceedings{pmlr-v119-chen20s,
  title = 	 {Generative Pretraining From Pixels},
  author =       {Chen, Mark and Radford, Alec and Child, Rewon and Wu, Jeffrey and Jun, Heewoo and Luan, David and Sutskever, Ilya},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {1691--1703},
  year = 	 {2020},
  editor = 	 {III, Hal DaumÃ© and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/chen20s/chen20s.pdf},
  url = 	 {https://proceedings.mlr.press/v119/chen20s.html
}

@inproceedings{deng2009imagenet,
  title={Imagenet: A large-scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},
  pages={248--255},
  year={2009},
  organization={Ieee}
}