Nanonets推出的Nanonets-OCR2是一系列功能强大、代表最新技术水平的图像转markdown OCR模型,其功能远超传统文本提取。它能将文档转换为带有智能内容识别和语义标注的结构化markdown,非常适合大型语言模型(LLMs)进行下游处理。
Nanonets-OCR2内置多项功能,可轻松处理复杂文档:
$...$)和块级公式($$...$$)。<img>标签描述文档中的图像,便于LLM处理。可描述多种图像类型,包括徽标、图表、图形等,详细说明其内容、样式和上下文。<signature>标签内。这对处理法律和商业文档至关重要。<watermark>标签内。☐、☑、☒),以实现一致且可靠的处理。| 模型 | 访问链接 |
|---|---|
| Nanonets-OCR2-Plus | Docstrange 链接 |
| Nanonets-OCR2-3B | 🤗 链接 |
| Nanonets-OCR2-1.5B-exp | 🤗 链接 |
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
model_path = "nanonets/Nanonets-OCR2-3B"
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
attn_implementation="flash_attention_2"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""
image = Image.open(image_path)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image", "image": f"file://{image_path}"},
{"type": "text", "text": prompt},
]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
inputs = inputs.to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
return output_text[0]
image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)vllm serve nanonets/Nanonets-OCR2-3Bfrom openai import OpenAI
import base64
client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")
model = "nanonets/Nanonets-OCR2-3B"
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
def ocr_page_with_nanonets_s(img_base64):
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_base64}"},
},
{
"type": "text",
"text": "Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes.",
},
],
}
],
temperature=0.0,
max_tokens=15000
)
return response.choices[0].message.content
test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_nanonets_s(img_base64))import requests
url = "https://extraction-api.nanonets.com/extract"
headers = {"Authorization": <API KEY>}
files = {"file": open("/path/to/your/file", "rb")}
data = {"output_type": "markdown"}
data["model"] = "nanonets"
response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())查看 Docstrange 了解更多详情。
| 模型 | 相较于 Nanonets OCR2 Plus 的胜率 (%) | 相较于 Nanonets OCR2 Plus 的败率 (%) | 双方均正确 (%) |
|---|---|---|---|
| Gemini 2.5 flash (No Thinking) | 34.35 | 57.60 | 8.06 |
| Nanonets OCR2 3B | 29.37 | 54.58 | 16.04 |
| Nanonets-OCR-s | 24.86 | 66.12 | 9.02 |
| Nanonets OCR2 1.5B exp | 13.00 | 81.20 | 5.79 |
| GPT-5 (Thinking: low) | 23.53 | 74.86 | 1.60 |
| 模型 | 相较于 Nanonets OCR2 3B 的胜率 (%) | 相较于 Nanonets OCR2 3B 的败率 (%) | 双方均正确 (%) |
|---|---|---|---|
| Gemini 2.5 flash (No Thinking) | 39.98 | 52.43 | 7.58 |
| Nanonets-OCR-s | 30.61 | 58.28 | 11.12 |
| Nanonets OCR2 1.5B exp | 14.78 | 79.18 | 6.04 |
| GPT-5 | 25.00 | 72.87 | 2.13 |
| 数据集 | Nanonets OCR2 Plus | Nanonets OCR2 3B | Qwen2.5-VL-72B-Instruct | Gemini 2.5 Flash |
|---|---|---|---|---|
| ChartQA(IDP-Leaderboard) | 79.20 | 78.56 | 76.20 | 84.82 |
| DocVQA(IDP-Leaderboard) | 85.15 | 89.43 | 84.00 | 85.51 |
repetition_penalty=1能获得更好的结果。您也可以尝试此提示词,它通常对财务文档更有效。user_prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""Markdown (Financial Docs) 选项。import requests
url = "https://extraction-api.nanonets.com/extract"
headers = {"Authorization": <API KEY>}
files = {"file": open("/path/to/your/file", "rb")}
data = {"output_type": "markdown-financial-docs"}
response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())@misc{Nanonets-OCR2,
title={Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging},
author={Souvik Mandal and Ashish Talewar and Siddhant Thakuria and Paras Ahuja and Prathamesh Juvatkar},
year={2025},
}