Idefics-Obelics logo

Transformers 版本：在下一个 Transformers PyPI 版本发布前，请从源代码安装 Transformers 并使用此 PR 以能够使用 Idefics3。待办事项：新版本发布时更新。

Idefics3

Idefics3 是一款开放的多模态模型，可接受任意序列的图像和文本输入并生成文本输出。该模型能够回答关于图像的问题、描述视觉内容、基于多张图像创作故事，或者在没有视觉输入的情况下单纯作为语言模型运行。它在 Idefics1 和 Idefics2 的基础上进行了改进，显著增强了在光学字符识别（OCR）、文档理解和视觉推理方面的能力。

我们以 Apache 2.0 许可协议发布这些检查点。

模型摘要

开发机构：Hugging Face
模型类型：多模态模型（图像+文本）
支持语言（自然语言处理）：en
许可协议：Apache 2.0
基础模型：google/siglip-so400m-patch14-384 和 meta-llama/Meta-Llama-3.1-8B-Instruct
更多信息资源：
- Idefics1 论文：OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
- Idefics2 论文：What matters when building vision-language models?
- Idefics3 论文：Building and better understanding vision-language models: insights and future directions

用途

Idefics3-8B 可用于多模态（图像 + 文本）任务的推理，其输入由文本查询以及一张（或多张）图像组成。文本和图像可以任意交错。这包括图像 captioning、视觉问答等。该模型不支持图像生成。

Idefics3-8B 的训练后阶段仅包含监督微调阶段，没有 RLHF 对齐。因此，模型可能会产生简短的回答，或者需要多次提示迭代才能完全满足用户的请求。在助手的回复前添加前缀，例如“让我们一步步解决这个问题”，已被发现能有效影响生成的输出。

为了在特定任务上微调 Idefics3-8B，我们提供了一份微调教程。其他适用于 Idefics2 微调的资源（可轻松适配 Idefics3）：

使用 TRL 库：脚本
使用 Hugging Face Trainer：教程笔记本

技术摘要

Idefics3 较 Idefics2 有了显著改进，尤其在文档理解任务中表现突出。它为各种特定用例的微调提供了坚实基础。

模型	MMMU (验证集)	MathVista (测试集)	MMStar (验证集)	DocVQA (测试集)	TextVQA (验证集)
Idefics2-8B	45.2	52.2	49.5	74.0	73.0
Idefics3-8B	46.6	58.4	55.9	87.7	74.9

Idefics3 相较于 Idefics2 有以下几处改动：

我们使用 169 个视觉令牌对 364x364 尺寸的图像进行编码。每张图像会被分割成多个尺寸不超过 364x364 的子图像，然后分别进行编码。
在微调数据集方面，我们扩展了 The Cauldron 并新增了多个数据集，包括 Docmatix。我们将很快把这些数据集推送到 The Cauldron 的同一代码库（待办事项）。

有关模型训练的更多详细信息，请参见我们的技术报告。

快速开始

本节展示了 Idefics3-8B 生成任务的代码片段。

import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda:0"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")

processor = AutoProcessor.from_pretrained("HuggingFaceM4/Idefics3-8B-Llama3")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/Idefics3-8B-Llama3", torch_dtype=torch.bfloat16
).to(DEVICE)

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ]
    },       
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)

文本生成推理

待办事项。

模型优化

如果您的 GPU 支持，我们首先建议以半精度（torch.float16 或 torch.bfloat16）加载模型（并运行推理）。

model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/Idefics3-8B-Llama3",
+    torch_dtype=torch.bfloat16,    
).to(DEVICE)

视觉编码器效率

初始化处理器（AutoProcessor.from_pretrained）时，可通过添加 size= {"longest_edge": N*364} 来选择图像将被缩放至的默认分辨率，其中 N 为您所需的值。实际应用中，N=4 的效果最佳（这是默认值），但对于非常大的图像，传入 N=5 可能会是个不错的选择。这将影响传递给语言模型的视觉 token 数量。如果您的 GPU 内存受限，可以减小 N，例如选择 N=3 或 N=2，尤其是对于低分辨率图像。

使用 Flash-attention 2 加速生成

点击展开。

首先，请确保已安装 flash-attn。有关软件包安装，请参考 Flash Attention 原始仓库。只需将上述代码片段修改为：

model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/Idefics3-8B-Llama3",
+    torch_dtype=torch.bfloat16,    
+    _attn_implementation="flash_attention_2",
).to(DEVICE)

误用与超出范围的使用

在高风险场景中使用本模型属于超出其适用范围。该模型并非为关键决策而设计，也不适用于任何可能对个人生计或福祉产生实质性影响的用途。模型输出的内容可能看似真实，但实际可能并不准确。超出范围的使用包括：

用于评估或评分个人，例如用于就业、教育或信贷领域
将模型应用于关键的自动决策、生成事实性内容、创建可靠摘要或生成必须准确的预测

故意使用本模型造成伤害、违反人权或进行其他恶意活动，均属于对本模型的误用。这包括：

生成垃圾信息
传播虚假信息和进行影响操作
贬低与诽谤
骚扰与虐待
欺骗
未经同意的冒充与模仿
未经同意的监视

许可证

本模型基于两个预训练模型构建：google/siglip-so400m-patch14-384 和 meta-llama/Meta-Llama-3.1-8B-Instruct。我们将 Idefics3 检查点以 Apache 2.0 许可证发布。

引用

BibTeX：

@misc{laurençon2024building,
      title={Building and better understanding vision-language models: insights and future directions.}, 
      author={Hugo Laurençon and Andrés Marafioti and Victor Sanh and Léo Tronchon},
      year={2024},
      eprint={2408.12637},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

致谢

感谢 @andito 和 @amyeroberts 在 Transformers 集成方面提供的帮助。