当前OmniDocBench评测环境已打包成docker镜像,镜像包名称:omnidocbench_zy_1.0.0.tar,用户如使用该镜像,可跳过环境搭建部分,直接从章节3开始即可。
docker run --name omnidocbench_test -itd \
-u root \
--privileged=true \
--net=host \
--shm-size=1000g \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
--device=/dev/dvpp_cmdlist \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /var/log/npu/:/usr/slog \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro \
-v /data_alpha/:/data_alpha/ \
-v /home/jh:/home/jh \
m.daocloud.io/quay.io/ascend/vllm-ascend:v0.11.0rc0 \
/bin/bash进入容器:
docker exec -it -u root omnidocbench_test bash##下载Omnidocbench代码, 安装依赖包
git clone https://github.com/opendatalab/OmniDocBench.git
cd OmniDocBench
pip install -r requirements.txt -i https://mirrors.huaweicloud.com/repository/pypi/simple --trusted-host mirrors.huaweicloud.com
##安装其他依赖
apt-get update
apt-get install libxml2-dev libxslt-dev libgl1 libglib2.0-0 latexml
pip install transformers==4.57.1
pip install xxhash-3.5.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl # xxhash需要单独下载whl安装,因为pip无法找到此包
pip install statsmodels==0.13.5 scikit-learn==1.1.3 # requirements里这两个包没有cp311版本的whl
pip install -r /root/codes/OmniDocBench-main/requirements.txt # 注意去掉上一行的几个包,huggingface-hub也去掉
pip install qwen_vl_utils
pip install accelerate
pip install scikit-image==0.23.2wget https://registry.npmmirror.com/-/binary/node/latest-v16.x/node-v16.13.1-linux-arm64.tar.gz
tar -xvf node-v16.13.1-linux-arm64.tar.gz
mv node-v16.13.1-linux-arm64/* /usr/local/nodejs/
ln -s /usr/local/nodejs/bin/node /usr/local/bin
ln -s /usr/local/nodejs/bin/npm /usr/local/bin
node -vapt-get update
# pkg-config没装的话,ImageMagick会找不到png包
# ghostscript不装会报gs not found,同时imagemagick也会报unknown device
apt-get install libpng-dev pkg-config ghostscript
https://github.com/ImageMagick/ImageMagick.git的release里找源码安装包7.1.1-47,7.1.2的不行
##解压并进入ImageMagick
./configure
make
make install
ldconfig /usr/local/lib
convert –version提供了两种安装方法,推荐使用方法2进行安装
# 方法1:版本可能会比较旧,导致mathcolor命令识别不了(不推荐)
apt-get update
apt-get install texlive-full
# 方法2(推荐):安装新版本
wget https://mirror.ctan.org/systems/texlive/tlnet/install-tl-unx.tar.gz
tar xzf install-tl-unx.tar.gz
cd install-tl-*
./install-tl
fmtutil-sys --all # 解决一些fmt未找到的问题I can't find the format file `xelatex.fmt'!##下载UniMERNet:
git clone https://github.com/opendatalab/UniMERNet.git
cd UniMERNet/cdm/
pip install -r requirements.txt1、使用CDM的时候出现I can't find the format file `xelatex.fmt'!的问题:执行下列命令创建丢失的fmt文件
fmtutil-sys --all2、评测时出现KeyError: 'latex'的报错
这个问题的原因是模型识别table的时候是以latex格式输出的。虽然prompt要求用html格式输出,但模型不一定会遵从。解决方法是在OmniDocBench/utils/extract.py文件增加一行import和修改部分代码,如下:
from utils.data_preprocess import normalized_latex_table
……
# extract latex table
latex_table_array, table_positions = extract_tex_table(content)
for latex_table, position in zip(latex_table_array, table_positions):
position = [position[0], position[0]+len(latex_table)] # !!!
pred_all.append({
'category_type': 'html_table', # 从latex_table改成html_table
'position': position,
'content': normalized_latex_table(latex_table) # 使用normalized_latex_table将latex格式转换成html格式
})
content = content[:position[0]] + ' '*(position[1]-position[0]) + content[position[1]:] # replace latex table with space
……多模态大模型在OmniDocBench数据集上需要先推理出markdown结果,然后再对结果进行评测,这里多模态大模型的部署推理有三种方式:
这里以qwen2.5vl-7b为例(参考OmniDocBench代码里的tools/model_infer/Qwen2VL_img2md.py脚本修改得到):
import json
import os
import base64
from tqdm import tqdm
from transformers import set_seed
import os
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu
def set_seed(seed):
import random
import numpy as np
import torch
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.cuda.manual_seed(seed)
# 指定模型的本地存储目录
model_dir = "/root/codes/temp_models/Qwen2.5-VL-7B-Instruct/"
# 加载模型
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_dir, torch_dtype=torch.bfloat16, device_map="auto",
# attn_implementation = "flash_attention_2"
)
# 加载处理器
processor = AutoProcessor.from_pretrained(model_dir)
# 设置输入和输出的基目录
input_dir = '/root/modelscope/datasets/evalscope/OmniDocBench/images'
output_dir = "/root/outputs/DocParseEval/demo_data/end2end_full_qwen2.5vl7b"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# complex prompt
prompt = r'''You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:
1. Text Processing:
- Accurately recognize all text content in the PDF image without guessing or inferring.
- Convert the recognized text into Markdown format.
- Maintain the original document structure, including headings, paragraphs, lists, etc.
2. Mathematical Formula Processing:
- Convert all mathematical formulas to LaTeX format.
- Enclose inline formulas with $ $. For example: This is an inline formula $ E = mc^2 $
- Enclose block formulas with \
$$ \$$
. For example:
$$ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} $$
3. Table Processing:
- Convert tables to HTML format.
- Wrap the entire table with <table> and </table>.
4. Figure Handling:
- Ignore figures content in the PDF image. Do not attempt to describe or convert images.
5. Output Format:
- Ensure the output Markdown document has a clear structure with appropriate line breaks between elements.
- For complex layouts, try to maintain the original document's structure and format as closely as possible.
Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.
'''
image_extensions = ('.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.webp')
# 遍历目录及其子目录下的所有文件,并过滤出图片文件
for root, _, files in os.walk(input_dir):
for name in files:
if any(name.lower().endswith(ext) for ext in image_extensions):
# 构建完整的文件路径
image_path = os.path.join(root, name)
# 提取出不包含文件后缀的文件名basename
basename = os.path.splitext(name)[0]
# 构建Markdown文件的完整路径
markdown_file = os.path.join(output_dir, f"{basename}.md")
# 如果markdown_file文件存在,则跳过
if os.path.exists(markdown_file):
print(f"文件已存在,跳过: {markdown_file}", flush=True)
continue
# 设置请求消息内容
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": image_path,
# "max_pixels":2048*2048
},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
set_seed(0)
generated_ids = model.generate(**inputs, max_new_tokens=32000, temperature=0.01,do_sample=False)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
# 构建Markdown文件的完整路径
markdown_file = os.path.join(output_dir, f"{basename}.md")
# 将响应写入Markdown文件
with open(markdown_file, 'w', encoding='utf-8') as file:
file.write(output_text[0])
print(f"Saved: {markdown_file}", flush=True)这里以qwen2.5vl-7b为例,使用vllm直接推理代码:
import json
import os
import base64
from tqdm import tqdm
from transformers import set_seed
import os
from vllm import LLM, SamplingParams
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu
def set_seed(seed):
import random
import numpy as np
import torch
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.cuda.manual_seed(seed)
# 图像预处理函数
def process_image_for_vl(image_path):
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
return f"data:image/png;base64,{image_data}"
# 指定模型的本地存储目录
model_dir = "/root/models/Qwen/Qwen2.5-VL-7B-Instruct/"
# 加载模型
llm = LLM(
model=model_dir,
tensor_parallel_size=1, # 单GPU设为1,多GPU可增加
enable_expert_parallel=False,
# gpu_memory_utilization=0.9,
max_model_len=32768,
trust_remote_code=True,
dtype="bfloat16"
)
sampling_params = SamplingParams(
temperature=0,
max_tokens=32000
)
print("***********\n",sampling_params,flush=True)
# 加载处理器
processor = AutoProcessor.from_pretrained(model_dir)
# 设置输入和输出的基目录
input_dir = '/root/modelscope/datasets/evalscope/OmniDocBench/images'
output_dir = "/root/outputs/DocParseEval/demo_data/test"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# complex prompt
prompt = r'''You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:
1. Text Processing:
- Accurately recognize all text content in the PDF image without guessing or inferring.
- Convert the recognized text into Markdown format.
- Maintain the original document structure, including headings, paragraphs, lists, etc.
2. Mathematical Formula Processing:
- Convert all mathematical formulas to LaTeX format.
- Enclose inline formulas with $ $. For example: This is an inline formula $ E = mc^2 $
- Enclose block formulas with \
$$ \$$
. For example:
$$ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} $$
3. Table Processing:
- Convert tables to HTML format.
- Wrap the entire table with <table> and </table>.
4. Figure Handling:
- Ignore figures content in the PDF image. Do not attempt to describe or convert images.
5. Output Format:
- Ensure the output Markdown document has a clear structure with appropriate line breaks between elements.
- For complex layouts, try to maintain the original document's structure and format as closely as possible.
Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.
'''
image_extensions = ('.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.webp')
# 遍历目录及其子目录下的所有文件,并过滤出图片文件
for root, _, files in os.walk(input_dir):
for name in files:
if any(name.lower().endswith(ext) for ext in image_extensions):
# 构建完整的文件路径
image_path = os.path.join(root, name)
# 提取出不包含文件后缀的文件名basename
basename = os.path.splitext(name)[0]
# 构建Markdown文件的完整路径
markdown_file = os.path.join(output_dir, f"{basename}.md")
# 如果markdown_file文件存在,则跳过
if os.path.exists(markdown_file):
print(f"文件已存在,跳过: {markdown_file}", flush=True)
continue
# image_path
# 设置请求消息内容
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": image_path,
# "max_pixels":2048*2048
},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = {
"prompt": text,
"multi_modal_data": {
"image": image_inputs
}
}
set_seed(0)
outputs = llm.generate([inputs], sampling_params)
generated_text = outputs[0].outputs[0].text
# 构建Markdown文件的完整路径
markdown_file = os.path.join(output_dir, f"{basename}.md")
# 将响应写入Markdown文件
with open(markdown_file, 'w', encoding='utf-8') as file:
file.write(generated_text)
print(f"Saved: {markdown_file}", flush=True)# 关键点:
# max-model-len:单次任务最多的token,如果请求图片可能较大,这个就不能太小
# max_num_batched_tokens:单次运输最多的token,条件允许的情况下还是大点。如果不设置,则默认为max-model-len*max-num-seqs
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 nohup vllm serve /root/models/Qwen/Qwen2.5-VL-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--max-num-seqs 2 \
--max_num_batched_tokens 20000 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--max-model-len 32768 \
--dtype=bfloat16 \
--served-model-name qwen2.5-vl-7b > /root/outputs/qwen2.5-vl-7b.log 2>&1 &1、对齐temperature、top_p、top_k、repetition_penalty参数;2、图像需经过process_vision_info处理后传出(服务化内部不会自动调用),注意,不要处理完存成文件再读取成base64,可能有误差。
from openai import OpenAI
import httpx
import os
import json
import yaml
from argparse import ArgumentParser
from qwen_vl_utils import process_vision_info
import base64
from io import BytesIO
from PIL import Image
def parser():
argparser = ArgumentParser(usage="Usage: python {}[--cfg <config file>]")
argparser.add_argument('--cfg', type=str,
default=os.path.join(os.path.dirname(__file__),'Qwen2.5VL.yaml'),
help='config file')
arg = argparser.parse_args()
with open(arg.cfg,'r') as f:
config = yaml.load(f,yaml.FullLoader)
return config
def PIL_to_base64(image,fmt="png"):
output_buffer = BytesIO()
image.save(output_buffer,format=fmt)
byte_data = output_buffer.getvalue()
return base64.b64encode(byte_data).decode('utf-8')
def image_to_base64(file_path):
with open(file_path, "rb") as f:
image_byte = f.read()
return base64.b64encode(image_byte).decode('utf-8')
IMAGE_FORMAT = {
"png": "png",
"apng": "apng",
"avif": "avif",
"gif": "gif",
"jpeg": "jpeg",
"jpg": "jpeg",
"jfif": "jpeg",
"pjpeg": "jpeg",
"pjp": "jpeg",
"svg": "svg+xml",
"webp": "webp",
}
def transform_image_format(image_ext):
image_ext = image_ext.lower()
assert image_ext in IMAGE_FORMAT, "unknown image extension: {}".format(image_ext)
return IMAGE_FORMAT[image_ext]
def check_content_type(content, force_base64=True):
c_type = "text"
if isinstance(content, Image.Image):
c_type = "image_url"
content = "data:image/{};base64,{}".format(
IMAGE_FORMAT["png"],PIL_to_base64(content))
elif ";base64," in content:
c_type = "image_url"
else:
image_ext = content.split(".")[-1].lower()
if image_ext in IMAGE_FORMAT:
c_type = "image_url"
if force_base64:
content = "data:image/{};base64,{}".format(
IMAGE_FORMAT[image_ext],image_to_base64(content))
return c_type, content
class MLLM:
def __init__(self, config):
self._client = OpenAI(
api_key=config["api_key"],
base_url=config["base_url"],
timeout=httpx.Timeout(7200.0, read=7200.0, write=7200.0, connect=7200.0)
)
self._model = config["model"]
self._system_prompt = config.get("system_prompt","")
self._force_base64 = config.get("force_base64",True)
def __call__(self, system_prompt, user_contents, **kwargs):
verbose_mode = kwargs.get("verbose_mode", False)
if verbose_mode:
print("input system_prompt:\n",system_prompt,flush=True)
print("input user_contents:\n",user_contents,flush=True)
messages = []
if system_prompt is None or system_prompt=="":
system_prompt = self._system_prompt
if system_prompt is not None and system_prompt!="":
messages.append({
"role": "system",
"content": system_prompt
})
user_message = None
for u_content in user_contents:
if u_content is None or u_content=="":
continue
if user_message is None:
user_message = {
"role": "user",
"content": []
}
c_type, u_content = check_content_type(u_content,self._force_base64)
user_message["content"].append(
{
"type": "text",
"text": u_content
} if c_type=="text" else \
{
"type": "image_url",
"image_url": {
"url": u_content
}
}
)
if user_message is not None:
messages.append(user_message)
if verbose_mode:
print("message:\n",messages,flush=True)
completion = self._client.chat.completions.create(
model=self._model,
messages=messages,
temperature=0,
top_p=1,
extra_body={ # 将不兼容的参数放在这里
"repetition_penalty": 1,
"top_k": 1
}
)
if verbose_mode:
print("completion:\n",completion,flush=True)
return completion.choices[0].message.content
if __name__=='__main__':
config = parser()
model = MLLM(config)
# 设置输入和输出的基目录
input_dir = '/root/modelscope/datasets/evalscope/OmniDocBench/images'
output_dir = "/root/outputs/DocParseEval/demo_data/end2end_full_vllm_api_qwen2.5vl7b_align"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# complex prompt
prompt = r'''You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:
1. Text Processing:
- Accurately recognize all text content in the PDF image without guessing or inferring.
- Convert the recognized text into Markdown format.
- Maintain the original document structure, including headings, paragraphs, lists, etc.
2. Mathematical Formula Processing:
- Convert all mathematical formulas to LaTeX format.
- Enclose inline formulas with $ $. For example: This is an inline formula $ E = mc^2 $
- Enclose block formulas with \
$$ \$$
. For example:
$$ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} $$
3. Table Processing:
- Convert tables to HTML format.
- Wrap the entire table with <table> and </table>.
4. Figure Handling:
- Ignore figures content in the PDF image. Do not attempt to describe or convert images.
5. Output Format:
- Ensure the output Markdown document has a clear structure with appropriate line breaks between elements.
- For complex layouts, try to maintain the original document's structure and format as closely as possible.
Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.
'''
image_extensions = ('.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.webp')
# 遍历目录及其子目录下的所有文件,并过滤出图片文件
for root, _, files in os.walk(input_dir):
for name in files:
if any(name.lower().endswith(ext) for ext in image_extensions):
# 构建完整的文件路径
image_path = os.path.join(root, name)
# 提取出不包含文件后缀的文件名basename
basename = os.path.splitext(name)[0]
# 构建Markdown文件的完整路径
markdown_file = os.path.join(output_dir, f"{basename}.md")
# 如果markdown_file文件存在,则跳过
if os.path.exists(markdown_file):
print(f"文件已存在,跳过: {markdown_file}", flush=True)
continue
# 设置请求消息内容
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": image_path,
# "max_pixels":2048*2048
},
{"type": "text", "text": prompt},
],
}
]
image_inputs, video_inputs = process_vision_info(messages)
try:
print(f"Processing: {image_path}", flush=True)
output = model(None,[image_inputs[0],prompt])
except Exception as e:
print(e,flush=True)
print("file failed, image_path: ", image_path, flush=True)
import sys
sys.exit()
# 将响应写入Markdown文件
with open(markdown_file, 'w', encoding='utf-8') as file:
file.write(output)
print(f"Saved: {markdown_file}", flush=True)# Qwen2.5VL.yaml
base_url: http://71.10.29.134:8000/v1
model: qwen2.5-vl-7b
api_key: test
force_base64: true将ground_truth的data_path修改为OmniDocBench数据集里的OmniDocBench.json路径,将prediction的data_path修改为章节2.2中多模态大模型预测的markdown结果路径,CDM_plain修改为CDM。
end2end_eval:
metrics:
text_block:
metric:
- Edit_dist
# - BLEU
# - METEOR
display_formula:
metric:
- Edit_dist
- CDM
##- CDM_plain # CDM can be calculated directly by calling CDM in config file if you have CDM environment
table:
metric:
- TEDS
- Edit_dist
reading_order:
metric:
- Edit_dist
dataset:
dataset_name: end2end_dataset
ground_truth:
data_path: /root/modelscope/datasets/evalscope/OmniDocBench/OmniDocBench.json
prediction:
data_path: /root/outputs/DocParseEval/demo_data/end2end_full_vllm_local_qwen2.5vl7b
match_method: quick_match
# filter:
# language: english配置完成后,执行以下命令进行一键评测
python pdf_validation.py --config ./configs/end2end.yaml评测结果会保存在result目录下
执行tools/generate_result_tables.ipynb脚本可以对结果进行可视化
import os
import pandas as pd
import numpy as np
import json
ocr_types_dict = {
'end2end': 'end2end'
}
result_folder = '../result'
match_name = 'quick_match'
# overall result: not distinguishing between Chinese and English, page-level average
dict_list = []
for ocr_type in ocr_types_dict.values():
result_path = os.path.join(result_folder, f'{ocr_type}_{match_name}_metric_result.json')
with open(result_path, 'r') as f:
result = json.load(f)
save_dict = {}
for category_type, metric in [("text_block", "Edit_dist"), ("display_formula", "CDM"), ("table", "TEDS"), ("table", "TEDS_structure_only"), ("reading_order", "Edit_dist")]:
if metric == 'CDM' or metric == "TEDS" or metric == "TEDS_structure_only":
if result[category_type]["page"].get(metric):
save_dict[category_type+'_'+metric] = result[category_type]["page"][metric]["ALL"] * 100 # page级别的avg
else:
save_dict[category_type+'_'+metric] = 0
else:
save_dict[category_type+'_'+metric] = result[category_type]["all"][metric].get("ALL_page_avg", np.nan)
dict_list.append(save_dict)
df = pd.DataFrame(dict_list, index=ocr_types_dict.keys()).round(3)
df['overall'] = ((1-df['text_block_Edit_dist'])*100 + df['display_formula_CDM'] + df['table_TEDS'])/3
df.to_csv('./overall.csv')可视化结果如下:
| text_block_Edit_dist | display_formula_CDM | table_TEDS | table_TEDS_structure_only | reading_order_Edit | overall | |
|---|---|---|---|---|---|---|
| end2end | 0.016 | 87.658 | 75.761 | 81.027 | 0.13 | 83.94 |
注:黑体为官方评测榜单结果,带有(A)的为我们在昇腾上的测试结果,颜色标记含义为:已对齐 | 未对齐 | 没有官方结果,但大体上无问题
| Model_Type | Model | size | overall | text_ED | formula_CDM | table_TEDS | table_TEDS-S | reading_order_ED |
|---|---|---|---|---|---|---|---|---|
| Specialized VLMs | HunyuanOCR | 1B | 94.10 | 0.042 | 94.73 | 91.81 | - | - |
| HunyuanOCR(A) | 1B | 94.47 | 0.039 | 89.17 | 98.128 | 98.478 | 0.039 | |
| PaddleOCR-VL | 0.9B | 91.93 | 0.039 | 88.67 | 91.01 | 94.85 | 0.048 | |
| MinerU2.5 | 1.2B | 90.67 | 0.047 | 88.46 | 88.22 | 92.38 | 0.044 | |
| MonkeyOCR-pro-3B | 3B | 88.85 | 0.075 | 87.25 | 86.78 | 90.63 | 0.128 | |
| OCRVerse | 4B | 88.56 | 0.058 | 86.91 | 84.55 | 88.45 | 0.071 | |
| dots.ocr | 1.7B | 88.41 | 0.048 | 83.22 | 86.78 | 90.62 | 0.053 | |
| dots.ocr(A) | 1.7B | 88.2 | 0.047 | 82.79 | 86.51 | 90.4 | 0.051 | |
| MonkeyOCR-3B | 3B | 87.13 | 0.075 | 87.45 | 81.39 | 85.92 | 0.129 | |
| Deepseek-OCR | 3B | 87.01 | 0.073 | 83.37 | 84.97 | 88.8 | 0.086 | |
| Deepseek-OCR(A,vllm-ascend) | 3B | 80.14 | 0.126 | 85.972 | 67.053 | 71.683 | 0.13 | |
| General VLMs | Qwen3-VL-235B -A22B-Instruct | 235B | 89.15 | 0.069 | 88.14 | 86.21 | 90.55 | 0.068 |
| Qwen3-VL(A,vllm-ascend) | 30B-A3B | 85.32 | 0.063 | 84.207 | 78.045 | 82.444 | 0.078 | |
| Qwen3-VL(A,vllm-ascend,api) | 30B-A3B | 85.52 | 0.061 | 84.48 | 78.169 | 82.427 | 0.076 | |
| Qwen3-VL(A,vllm-ascend) | 8B | 88.04 | 0.048 | 86.183 | 82.723 | 87.31 | 0.065 | |
| Qwen3-VL(A,vllm-ascend,api) | 8B | 88.12 | 0.048 | 86.254 | 82.904 | 87.608 | 0.067 | |
| Gemini-2.5 Pro | - | 88.03 | 0.075 | 85.82 | 85.71 | 90.29 | 0.097 | |
| Qwen2.5-VL | 72B | 87.02 | 0.094 | 88.27 | 82.15 | 86.22 | 0.102 | |
| Qwen2.5-VL(A,vllm-ascend) | 72B | 87.364 | 0.092 | 88.43 | 82.862 | 87.255 | 0.097 | |
| Qwen2.5-VL(A,transformers4.57.1) | 7B | 83.94 | 0.116 | 87.658 | 75.761 | 81.027 | 0.13 | |
| Qwen2.5-VL(A,vllm-ascend) | 7B | 80.53 | 0.169 | 85.904 | 72.584 | 77.067 | 0.171 | |
| Qwen2.5-VL(A,vllm-ascend,api) | 7B | 82.74 | 0.128 | 86.898 | 74.126 | 78.746 | 0.133 | |
| InternVL3.5 | 241B | 82.67 | 0.142 | 87.23 | 75 | 81.28 | 0.125 | |
| Pangu_mllm_v3(A) | 7B | 79.18 | 0.133 | 82.366 | 68.463 | 75.534 | 0.166 | |
| GPT-4o | - | 75.02 | 0.217 | 79.7 | 67.07 | 76.09 | 0.148 |
本案例搭建了一套用于评估多模态大模型文档解析能力的OmninDocBench数据集评测环境,为后续评测多模态大模型的文档解析能力提供了可直接测试的环境和方法,大大提高了多模态大模型文档能力评测的效率;