[!IMPORTANT] 此版本的 ColPali 应使用 transformers 🤗 版本加载，而非 colpali-engine。它是使用 convert_colpali_weights_to_hf.py 脚本从 vidore/colpali-v1.3-merged 检查点转换而来。

ColPali：基于 PaliGemma-3B 与 ColBERT 策略的视觉检索模型

ColPali 是一款基于新颖模型架构和训练策略的模型，它以视觉语言模型（VLMs）为基础，能够从文档的视觉特征中高效地建立索引。它是 PaliGemma-3B 的扩展，可生成 ColBERT 风格的文本和图像多向量表示。该模型在论文 ColPali: Efficient Document Retrieval with Vision Language Models 中首次提出，并在此仓库中首次发布。

HuggingFace transformers 🤗 实现由 Tony Wu（@tonywu71）和 Yoni Gozlan（@yonigozlan）贡献。

模型描述

请阅读 transformers 🤗 模型卡片：https://huggingface.co/docs/transformers/en/model_doc/colpali。

模型训练

数据集

我们的训练数据集包含 127,460 个查询-页面对，由公开可用的学术数据集的训练集（占 63%）和一个合成数据集组成。合成数据集由网络爬取的 PDF 文档页面构成，并通过 VLM 生成（Claude-3 Sonnet）的伪问题进行增强（占 37%）。我们的训练集特意设计为全英文，以便我们研究其对非英语语言的零样本泛化能力。我们明确确保没有多页 PDF 文档同时用于 ViDoRe 和训练集，以防止评估污染。我们还创建了一个包含 2% 样本的验证集，用于超参数调优。

注：语言模型（Gemma-2B）的预训练语料中存在多语言数据，并且可能在 PaliGemma-3B 的多模态训练过程中出现。

参数

所有模型均在训练集上训练 1 个 epoch。除非另有说明，否则我们以 bfloat16 格式训练模型，在语言模型的 transformer 层以及最终随机初始化的投影层上使用低秩适配器（LoRA），其中 alpha=32 且 r=32，并使用 paged_adamw_8bit 优化器。我们在 8 GPU 设备上采用数据并行方式进行训练，学习率为 5e-5，采用线性衰减策略，预热步数为 2.5%，批大小为 32。

使用方法

import torch
from PIL import Image

from transformers import ColPaliForRetrieval, ColPaliProcessor

model_name = "vidore/colpali-v1.3-hf"

model = ColPaliForRetrieval.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
).eval()

processor = ColPaliProcessor.from_pretrained(model_name)

# Your inputs
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor(images=images).to(model.device)
batch_queries = processor(text=queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings.embeddings, image_embeddings.embeddings)

资源

ColPali 的 arXiv 论文可在此处获取。📄
详细介绍 ColPali 的官方博客文章可在此处查看。📝
ColPali 模型及 colpali-engine 包的原始模型实现代码可在此处找到。🌎
用于学习使用 transformers 原生版本 ColPali、微调以及生成相似度图的指南可在此处获取。📚

局限性

专注领域：该模型主要专注于 PDF 类型文档和高资源语言，这可能限制了其对其他文档类型或代表性不足语言的泛化能力。
支持情况：该模型依赖于从 ColBERT 晚期交互机制衍生的多向量检索，这可能需要一定的工程工作才能适配那些缺乏原生多向量支持的广泛使用的向量检索框架。

许可证

ColPali 的视觉语言骨干模型（PaliGemma）遵循其模型卡片中指定的 gemma 许可证。ColPali 继承此 gemma 许可证。

联系方式

Manuel Faysse：manuel.faysse@illuin.tech
Hugues Sibille：hugues.sibille@illuin.tech
Tony Wu：tony.wu@illuin.tech

引用

如果您在研究中使用了本组织的任何数据集或模型，请按以下方式引用原始数据集：

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}