Nanonets-OCR2：一款可将文档转换为结构化markdown的模型，具备智能内容识别与语义标注功能

🖥️ 在线演示 | 📢 博客 | ⌨️ GitHub

Nanonets推出的Nanonets-OCR2是一系列功能强大、代表最新技术水平的图像转markdown OCR模型，其功能远超传统文本提取。它能将文档转换为带有智能内容识别和语义标注的结构化markdown，非常适合大型语言模型（LLMs）进行下游处理。

Nanonets-OCR2内置多项功能，可轻松处理复杂文档：

LaTeX公式识别：自动将数学方程和公式转换为格式正确的LaTeX语法。能区分行内公式（ $...$ ）和块级公式（$$...$$）。
智能图像描述：使用结构化<img>标签描述文档中的图像，便于LLM处理。可描述多种图像类型，包括徽标、图表、图形等，详细说明其内容、样式和上下文。
签名检测与隔离：从其他文本中识别并隔离签名，将其输出到<signature>标签内。这对处理法律和商业文档至关重要。
水印提取：从文档中检测并提取水印文本，将其置于<watermark>标签内。
智能复选框处理：将表单复选框和单选按钮转换为标准化Unicode符号（☐、☑、☒），以实现一致且可靠的处理。
复杂表格提取：准确提取复杂表格，并将其转换为markdown和HTML表格格式。
流程图和组织结构图：将流程图和组织结构图提取为mermaid代码。
手写文档：该模型在多种语言的手写文档上进行了训练。
多语言支持：模型在多种语言的文档上进行了训练，包括英语、中文、法语、西班牙语、葡萄牙语、德语、意大利语、俄语、日语、韩语、阿拉伯语等。
视觉问答（VQA）：模型设计为若答案存在于文档中则直接提供，否则回复“未提及”。

Nanonets-OCR2 系列

模型	访问链接
Nanonets-OCR2-Plus	Docstrange 链接
Nanonets-OCR2-3B	🤗 链接
Nanonets-OCR2-1.5B-exp	🤗 链接

使用方法

使用 transformers

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText

model_path = "nanonets/Nanonets-OCR2-3B"

model = AutoModelForImageTextToText.from_pretrained(
    model_path, 
    torch_dtype="auto", 
    device_map="auto", 
    attn_implementation="flash_attention_2"
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)


def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]

image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)

使用 vLLM

启动 vLLM 服务器。

vllm serve nanonets/Nanonets-OCR2-3B

使用模型进行预测

from openai import OpenAI
import base64

client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")

model = "nanonets/Nanonets-OCR2-3B"

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def ocr_page_with_nanonets_s(img_base64):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{img_base64}"},
                    },
                    {
                        "type": "text",
                        "text": "Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes.",
                    },
                ],
            }
        ],
        temperature=0.0,
        max_tokens=15000
    )
    return response.choices[0].message.content

test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_nanonets_s(img_base64))

使用 Docstrange

import requests

url = "https://extraction-api.nanonets.com/extract"
headers = {"Authorization": <API KEY>}

files = {"file": open("/path/to/your/file", "rb")}
data = {"output_type": "markdown"}
data["model"] = "nanonets"

response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())

查看 Docstrange 了解更多详情。

评估

Markdown 评估

Nanonets OCR2 Plus

模型	相较于 Nanonets OCR2 Plus 的胜率 (%)	相较于 Nanonets OCR2 Plus 的败率 (%)	双方均正确 (%)
Gemini 2.5 flash (No Thinking)	34.35	57.60	8.06
Nanonets OCR2 3B	29.37	54.58	16.04
Nanonets-OCR-s	24.86	66.12	9.02
Nanonets OCR2 1.5B exp	13.00	81.20	5.79
GPT-5 (Thinking: low)	23.53	74.86	1.60

Nanonets OCR2 3B

模型	相较于 Nanonets OCR2 3B 的胜率 (%)	相较于 Nanonets OCR2 3B 的败率 (%)	双方均正确 (%)
Gemini 2.5 flash (No Thinking)	39.98	52.43	7.58
Nanonets-OCR-s	30.61	58.28	11.12
Nanonets OCR2 1.5B exp	14.78	79.18	6.04
GPT-5	25.00	72.87	2.13

视觉问答（VQA）评估

数据集	Nanonets OCR2 Plus	Nanonets OCR2 3B	Qwen2.5-VL-72B-Instruct	Gemini 2.5 Flash
ChartQA（IDP-Leaderboard）	79.20	78.56	76.20	84.82
DocVQA（IDP-Leaderboard）	85.15	89.43	84.00	85.51

提高准确率的技巧

提高图像分辨率可以提升模型性能。
对于复杂表格（例如财务文档），使用repetition_penalty=1能获得更好的结果。您也可以尝试此提示词，它通常对财务文档更有效。

user_prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""

此功能已在 Docstrange 中实现，处理表格密集型财务文档时，请使用 Markdown (Financial Docs) 选项。

import requests

url = "https://extraction-api.nanonets.com/extract"
headers = {"Authorization": <API KEY>}

files = {"file": open("/path/to/your/file", "rb")}
data = {"output_type": "markdown-financial-docs"}

response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())

参考文献格式

@misc{Nanonets-OCR2,
  title={Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging},
  author={Souvik Mandal and Ashish Talewar and Siddhant Thakuria and Paras Ahuja and Prathamesh Juvatkar},
  year={2025},
}

Nanonets-OCR2：一款可将文档转换为结构化markdown的模型，具备智能内容识别与语义标注功能

🖥️ 在线演示 | 📢 博客 | ⌨️ GitHub

Nanonets-OCR2内置多项功能，可轻松处理复杂文档：

LaTeX公式识别：自动将数学方程和公式转换为格式正确的LaTeX语法。能区分行内公式（ $...$ ）和块级公式（$$...$$）。
智能图像描述：使用结构化<img>标签描述文档中的图像，便于LLM处理。可描述多种图像类型，包括徽标、图表、图形等，详细说明其内容、样式和上下文。
签名检测与隔离：从其他文本中识别并隔离签名，将其输出到<signature>标签内。这对处理法律和商业文档至关重要。
水印提取：从文档中检测并提取水印文本，将其置于<watermark>标签内。
智能复选框处理：将表单复选框和单选按钮转换为标准化Unicode符号（☐、☑、☒），以实现一致且可靠的处理。
复杂表格提取：准确提取复杂表格，并将其转换为markdown和HTML表格格式。
流程图和组织结构图：将流程图和组织结构图提取为mermaid代码。
手写文档：该模型在多种语言的手写文档上进行了训练。
多语言支持：模型在多种语言的文档上进行了训练，包括英语、中文、法语、西班牙语、葡萄牙语、德语、意大利语、俄语、日语、韩语、阿拉伯语等。
视觉问答（VQA）：模型设计为若答案存在于文档中则直接提供，否则回复“未提及”。

Nanonets-OCR2 系列

模型	访问链接
Nanonets-OCR2-Plus	Docstrange 链接
Nanonets-OCR2-3B	🤗 链接
Nanonets-OCR2-1.5B-exp	🤗 链接

使用方法

使用 transformers

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText

model_path = "nanonets/Nanonets-OCR2-3B"

model = AutoModelForImageTextToText.from_pretrained(
    model_path, 
    torch_dtype="auto", 
    device_map="auto", 
    attn_implementation="flash_attention_2"
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)


def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]

image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)

使用 vLLM

启动 vLLM 服务器。

vllm serve nanonets/Nanonets-OCR2-3B

使用模型进行预测

from openai import OpenAI
import base64

client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")

model = "nanonets/Nanonets-OCR2-3B"

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def ocr_page_with_nanonets_s(img_base64):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{img_base64}"},
                    },
                    {
                        "type": "text",
                        "text": "Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes.",
                    },
                ],
            }
        ],
        temperature=0.0,
        max_tokens=15000
    )
    return response.choices[0].message.content

test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_nanonets_s(img_base64))

使用 Docstrange

import requests

url = "https://extraction-api.nanonets.com/extract"
headers = {"Authorization": <API KEY>}

files = {"file": open("/path/to/your/file", "rb")}
data = {"output_type": "markdown"}
data["model"] = "nanonets"

response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())

查看 Docstrange 了解更多详情。

评估

Markdown 评估

Nanonets OCR2 Plus

模型	相较于 Nanonets OCR2 Plus 的胜率 (%)	相较于 Nanonets OCR2 Plus 的败率 (%)	双方均正确 (%)
Gemini 2.5 flash (No Thinking)	34.35	57.60	8.06
Nanonets OCR2 3B	29.37	54.58	16.04
Nanonets-OCR-s	24.86	66.12	9.02
Nanonets OCR2 1.5B exp	13.00	81.20	5.79
GPT-5 (Thinking: low)	23.53	74.86	1.60

Nanonets OCR2 3B

模型	相较于 Nanonets OCR2 3B 的胜率 (%)	相较于 Nanonets OCR2 3B 的败率 (%)	双方均正确 (%)
Gemini 2.5 flash (No Thinking)	39.98	52.43	7.58
Nanonets-OCR-s	30.61	58.28	11.12
Nanonets OCR2 1.5B exp	14.78	79.18	6.04
GPT-5	25.00	72.87	2.13

视觉问答（VQA）评估

数据集	Nanonets OCR2 Plus	Nanonets OCR2 3B	Qwen2.5-VL-72B-Instruct	Gemini 2.5 Flash
ChartQA（IDP-Leaderboard）	79.20	78.56	76.20	84.82
DocVQA（IDP-Leaderboard）	85.15	89.43	84.00	85.51

提高准确率的技巧

提高图像分辨率可以提升模型性能。
对于复杂表格（例如财务文档），使用repetition_penalty=1能获得更好的结果。您也可以尝试此提示词，它通常对财务文档更有效。

user_prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""

此功能已在 Docstrange 中实现，处理表格密集型财务文档时，请使用 Markdown (Financial Docs) 选项。

import requests

url = "https://extraction-api.nanonets.com/extract"
headers = {"Authorization": <API KEY>}

files = {"file": open("/path/to/your/file", "rb")}
data = {"output_type": "markdown-financial-docs"}

response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())

参考文献格式

@misc{Nanonets-OCR2,
  title={Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging},
  author={Souvik Mandal and Ashish Talewar and Siddhant Thakuria and Paras Ahuja and Prathamesh Juvatkar},
  year={2025},
}