Granite 4.0 3B Vision

模型摘要： Granite-4.0-3B-Vision 是一款面向企业级文档数据提取的视觉语言模型（VLM）。它专注于超紧凑型模型通常难以处理的专业化、复杂提取任务：

图表提取：将图表转换为结构化、机器可读格式（Chart2CSV、Chart2Summary 和 Chart2Code）
表格提取：从文档图像中准确提取具有复杂布局的表格，并输出为 JSON、HTML 或 OTSL 格式
语义键值对（KVP）提取：基于键名和描述，在多样化的文档布局中提取对应的值

该模型以 LoRA 适配器的形式提供，构建于 Granite 4.0 Micro 基础之上，包含 3.5B 参数的基础 LLM 和 0.5B 参数的 LoRA 适配器。这使得单一部署即可同时支持多模态文档理解和纯文本工作负载——基础模型在不加载适配器的情况下可处理纯文本请求。详情参见模型架构。

本模型所使用的方法和数据（ChartNet）在论文《ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding》中有所描述（https://huggingface.co/papers/2603.27064）。

尽管我们的重点是专业化文档提取任务，当前模型仍保留并扩展了 Granite-Vision-3.3 2B 的功能，确保现有用户能够无缝采用，无需更改其工作流程。它继续支持视觉-语言任务，例如从图像生成详细的自然语言描述（图像到文本）。该模型可独立使用，并与 Docling 无缝集成，通过深度视觉理解能力增强文档处理管道。

开发者：IBM Research
GitHub 仓库：https://github.com/ibm-granite
发布日期：2026 年 3 月 27 日
许可证：Apache 2.0

支持的任务

该模型支持专门的抽取任务，每个任务可通过用户消息中的简单任务标签激活。聊天模板会自动将标签扩展为完整提示词——无需编写冗长的指令。

标签	任务	输出
`<chart2csv>`	图表转CSV	带标题和数值的CSV表格
`<chart2code>`	图表转Python代码	用于重新创建图表的Python代码
`<chart2summary>`	图表转摘要	图表的自然语言描述
`<tables_json>`	表格抽取（JSON）	包含维度和单元格的结构化JSON
`<tables_html>`	表格抽取（HTML）	HTML `<table>` 标记
`<tables_otsl>`	表格抽取（OTSL）	带单元格/合并标签的OTSL标记
KVP（参见下文提示词说明）	基于 schema 的键值对抽取	包含嵌套字典和数组的JSON

模型性能

基准测试结果

我们在多个抽取基准上将 Granite-4.0-3B-Vision 与领先的小型视觉语言模型（VLM）进行了比较。

图表抽取

我们使用 ChartNet 中经过人工验证的测试集来评估图表抽取。通过 LLM 作为评判者（GPT4o），将模型预测结果与真实值进行比较。我们报告了 chart2csv 和 chart2summary 抽取任务的 0-100 分平均得分。

表格抽取

为了对表格抽取进行基准测试，我们构建了一个统一的评估套件，涵盖多个数据集和设置，以评估视觉语言模型的端到端表格抽取能力：

TableVQA-Extract——将原始的视觉表格问答基准转换为裁剪表格抽取任务。
OmniDocBench-tables——一个针对多种PDF类型的文档解析基准，包含布局、文本、公式和表格的详细标注。我们使用包含一个或多个表格的页面子集，在全页设置下评估表格抽取。
PubTablesV2——一个大规模表格抽取基准，在裁剪表格和全页文档两种设置下进行评估。

为了统一评估，我们将每个数据集的原始标注（例如问答对）替换为单一指令：从图像中以HTML格式抽取表格，并使用相应的HTML作为真实值。对于全页输入，仅考虑表格元素；当出现多个表格时，将它们聚合到一个Python列表中。

我们使用TEDS（基于树编辑距离的相似度）报告结果，该指标用于衡量预测的HTML表格与真实HTML表格之间的结构和内容相似度。

结果分别针对裁剪表格和全页设置呈现，以突出模型在受控场景和真实文档场景下的性能。

键值对（KVP）抽取

我们在 VAREX 基准上进行评估，该基准用于从文档中进行多模态结构化抽取。Granite-4.0-3B-Vision 实现了 85.5% 的精确匹配准确率（零样本），截至 2026 年 3 月，在 2–4B 参数模型中排名第三（查看结果此处）。

环境设置

已在 python=3.11 版本下测试通过

pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/cu128
pip install transformers==4.57.6 peft==0.18.1 tokenizers==0.22.2 pillow==12.1.1

使用 Transformers

（注：原文仅为标题，无其他内容。根据要求，仅翻译文本内容，保持 Markdown 格式，专业名词“Transformers”不翻译。）

import re
from io import StringIO

import pandas as pd
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
from huggingface_hub import hf_hub_download

model_id = "ibm-granite/granite-4.0-3b-vision"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map=device
).eval()

# Optional: merge LoRA adapters into base weights for faster inference.
# Prefer to skip when using text-only tasks, as the LoRA adapters are vision-specific.
model.merge_lora_adapters()

def run_inference(model, processor, images, prompts):
    """Run batched inference on image+prompt pairs (one image per prompt)."""
    conversations = [
        [{"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": prompt},
        ]}]
        for prompt in prompts
    ]
    texts = [
        processor.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
        for conv in conversations
    ]
    inputs = processor(
        text=texts, images=images, return_tensors="pt", padding=True, do_pad=True
    ).to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=4096, 
        use_cache=True
    )
    results = []
    for i in range(len(prompts)):
        gen = outputs[i, inputs["input_ids"].shape[1]:]
        results.append(processor.decode(gen, skip_special_tokens=True))
    return results


def display_table(text):
    """Pretty-print CSV (possibly wrapped in ```csv```) or HTML table content via pandas."""
    m = re.search(r"```csv\s*
(.*?)```", text, re.DOTALL)
    if m:
        df = pd.read_csv(StringIO(m.group(1)))
        print(df.to_string(index=False))
    elif "<table" in text.lower():
        df = pd.read_html(StringIO(text))[0]
        print(df.to_string(index=False))
    else:
        print(text)

图表和表格任务

您可以传递标签，聊天模板会处理其余部分：

chart_path = hf_hub_download(repo_id=model_id, filename="chart.jpg")
table_path = hf_hub_download(repo_id=model_id, filename="table.png")
chart_img = Image.open(chart_path).convert("RGB")
table_img = Image.open(table_path).convert("RGB")

# Batched chart tasks
chart_prompts = ["<chart2csv>", "<chart2summary>", "<chart2code>"]
chart_results = run_inference(model, processor, [chart_img] * len(chart_prompts), chart_prompts)
for prompt, result in zip(chart_prompts, chart_results):
    print(f"{prompt}:")
    display_table(result)
    print()

# Batched table tasks
table_prompts = ["<tables_html>", "<tables_otsl>"]
table_results = run_inference(model, processor, [table_img] * len(table_prompts), table_prompts)
for prompt, result in zip(table_prompts, table_results):
    print(f"{prompt}:")
    display_table(result)
    print()

键值对提取（KVP）

进行键值对提取时，请使用VAREX提示格式。提供一个描述待提取字段的JSON模式，模型将返回一个包含提取值的JSON对象。

import json

invoice_path = hf_hub_download(repo_id=model_id, filename="invoice.png")
invoice_img = Image.open(invoice_path).convert("RGB")
schema = {
    "type": "object",
    "properties": {
        "invoice_date": {"type": "string", "description": "The date the invoice was issued"},
        "order_number": {"type": "string", "description": "The unique identifier for the order"},
        "seller_tax_id": {"type": "string", "description": "The tax identification number of the seller"},
    }
}

prompt = f"""Extract structured data from this document.
Return a JSON object matching this schema:

{json.dumps(schema, indent=2)}

Return null for fields you cannot find.
Return ONLY valid JSON.
Return an instance of the JSON with extracted values, not the schema itself."""

result = run_inference(model, processor, [invoice_img], [prompt])[0]
print(result)

与 vLLM 配合使用

该模型包含一个自定义 vLLM 实现（granite4_vision.py）和一个服务器启动器（start_granite4_vision_server.py），它们通过 vLLM 的树外模型集成来注册模型，无需从源代码构建 vLLM。

pip install vllm
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

从本仓库下载 granite4_vision.py 和 start_granite4_vision_server.py。

hf download ibm-granite/granite-4.0-3b-vision granite4_vision.py . 
hf download ibm-granite/granite-4.0-3b-vision start_granite4_vision_server.py .

然后，从包含已下载文件的目录中，你可以启动服务器。

使用 vLLM 的服务模式

完全合并（加载时合并适配器）

所有 LoRA 增量在加载时合并到基础权重中。这能实现最快的推理速度，因为每个请求都使用合并后的模型。

python start_granite4_vision_server.py \
    --model ibm-granite/granite-4.0-3b-vision \
    --trust_remote_code --host 0.0.0.0 --port 8000 \
    --hf-overrides '{"adapter_path": "ibm-granite/granite-4.0-3b-vision"}'

原生 LoRA 运行时

vLLM 会根据每个请求动态应用 LM LoRA。纯文本提示使用纯基础模型，而图像提示则在推理时应用 LoRA 适配器。

python start_granite4_vision_server.py \
    --model ibm-granite/granite-4.0-3b-vision \
    --trust_remote_code --host 0.0.0.0 --port 8000 \
    --enable-lora --max-lora-rank 256 \
    --default-mm-loras '{"image": "ibm-granite/granite-4.0-3b-vision"}'

客户端示例

使用 OpenAI 兼容 API 查询运行中的服务器：

import base64
from openai import OpenAI
from huggingface_hub import hf_hub_download
from PIL import Image

model_id = "ibm-granite/granite-4.0-3b-vision"
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

def run_inference(client, model_id, image_path, tag):
    with open(image_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode("utf-8")
    messages = [
        {"role": "user", "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
            {"type": "text", "text": tag},
        ]}
    ]
    response = client.chat.completions.create(
        model=model_id, messages=messages, max_tokens=4096, temperature=0,
    )
    return response.choices[0].message.content

chart_path = hf_hub_download(repo_id=model_id, filename="chart.jpg")
table_path = hf_hub_download(repo_id=model_id, filename="table.png")

# Chart tasks
for tag in ["<chart2csv>", "<chart2summary>", "<chart2code>"]:
    result = run_inference(client, model_id, chart_path, tag)
    print(f"{tag}:
{result}
")

# Table tasks
for tag in ["<tables_json>", "<tables_html>", "<tables_otsl>"]:
    result = run_inference(client, model_id, table_path, tag)
    print(f"{tag}:
{result}
")

训练数据

该模型在精心筛选的混合数据集上进行了微调，这些数据集以信息提取为核心，涵盖图表理解、复杂表格解析和文档键值对（KVP）提取，并辅以通用的Granite Vision指令遵循数据集，以实现广泛的视觉理解。

图表理解数据通过一种新颖的代码引导增强方法生成，该方法能产出多样化、语义对齐的图表样本，包含渲染代码、图表图像、底层数据CSV文件以及自然语言摘要。借助此流程，我们还发布了ChartNet——一个包含百万级样本的综合多模态数据集，其中还补充了真实世界、人工标注、安全及 grounding 子集。该数据集及其构建方法的详细说明，请参见论文ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding。

模型架构

SigLIP2 视觉编码器：google/siglip2-so400m-patch16-384。输入图像被分割为384×384的图块（始终包含一个基础的降采样视图），每个图块独立编码。
窗口Q-Former投影器：视觉特征通过窗口化Q-Former投影器进行4倍压缩：每个4×4的图块窗口通过交叉注意力机制缩减为2×2的 tokens，其中查询（queries）由窗口特征的降采样版本初始化。这减少了输入到LLM的视觉 token 数量。
特征注入：Deepstack的一种变体，通过两种互补机制将视觉特征以加法方式注入LLM的多个层的隐藏状态：
- LayerDeepstack：来自4个视觉编码器深度的特征分别经过投影后，注入到不同的LLM层。Q-Former查询由降采样特征初始化。映射关系是反向的——最深层（语义最丰富）的视觉特征输入到LLM的最早期层，从一开始就提供强大的语义 grounding。
- SpatialDeepstack：最高分辨率的最深层视觉特征被分割为4个互补的空间组。每个组的Q-Former查询由相应的空间子集初始化，并注入到LLM后续的不同层，提供细粒度的空间细节。
总体而言，8个视觉到LLM的注入点将视觉信息分布在整个网络中，以实现更强的视觉 grounding。
语言模型：Granite 4.0 Micro（3B），在所有自注意力投影和MLP层上应用LoRA（秩256）。

支持的输入：英文指令和图像（PNG、JPEG格式）。

基础设施

Granite 4.0 Vision 是在 IBM 的 Blue Vela 超级计算集群上进行训练的，该集群配备了 NVIDIA H100 GPU。训练使用了 32 块 GPU，耗时约 200 小时。

伦理考量与局限性

在部署视觉语言模型之前，应考虑其使用涉及的某些风险：

任务范围：该模型专为结构化提取任务设计，可能无法很好地泛化到开放式视觉语言任务。
幻觉现象：与所有生成式模型一样，在自动化流程中使用输出结果之前应进行验证，尤其是在高风险的文档处理场景中。
语言：该模型仅基于英语指令进行训练，对于其他语言的文档，其生成结果的质量可能会下降。

为提高企业部署的安全性，我们建议将 Granite 4.0 Vision 与 Granite Guardian 配合使用。Granite Guardian 是一个旨在跨 IBM AI Risk Atlas 中概述的关键维度检测和标记输入与输出风险的模型。

资源

⭐️ 了解 Granite 的最新更新：https://www.ibm.com/granite
🚀 获取教程、最佳实践和提示工程建议，开始使用：https://www.ibm.com/granite/docs/
💡 Granite 学习资源：https://ibm.biz/granite-learning-resources

引用

@misc{granite-4.0-3b-vision,
  title={Granite 4.0 Vision},
  author={IBM Granite Vision Team},
  year={2026},
  url={https://huggingface.co/ibm-granite/granite-4.0-3b-vision}
}

@article{kondic2026chartnet,
  title={ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding},
  author={Kondic, Jovana and Li, Pengyuan and Joshi, Dhiraj and Sanchez, Isaac and Wiesel, Ben and Abedin, Shafiq and Alfassy, Amit and Schwartz, Eli and Caraballo, Daniel and Cinar, Yagmur Gizem and Scheidegger, Florian and Ross, Steven I. and Weidele, Daniel Karl I. and Hua, Hang and Arutyunova, Ekaterina and Herzig, Roei and He, Zexue and Wang, Zihan and Yu, Xinyue and Zhao, Yunfei and Jiang, Sicong and Liu, Minghao and Lin, Qunshu and Staar, Peter and Lastras, Luis and Oliva, Aude and Feris, Rogerio},
  journal={arXiv preprint arXiv:2603.27064},
  year={2026}
}