Ascend-SACT/Qwen2.5VL-7B-Instruct
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

Qwen2.5VL-7B-Instruct在OmniDocBench数据集上的评测部署实践

1. 概述及场景

OmniDocBench(https://github.com/opendatalab/OmniDocBench)是一个针对真实场景下多样性文档解析的评测集,具有文档类型多样、标注信息丰富、标注质量高以及配套了评测代码等特点,支持对文档文本OCR、表格识别、公式识别等多个维度评测的能力,是评测多模态大模型的文档解析能力的代表性评测集。本案例以Qwen2.5VL-7B-Instruct为例,实践了多模态大模型在昇腾上的推理部署、精度对齐以及在OmniDocBench数据集上的评测。

2. OmniDocbench评测环境搭建

当前OmniDocBench评测环境已打包成docker镜像,镜像包名称:omnidocbench_zy_1.0.0.tar,用户如使用该镜像,可跳过环境搭建部分,直接从章节3开始即可。

2.1 启动镜像

docker run --name omnidocbench_test -itd \
-u root \
--privileged=true \
--net=host \
--shm-size=1000g \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
--device=/dev/dvpp_cmdlist \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /var/log/npu/:/usr/slog \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro \
-v /data_alpha/:/data_alpha/ \
-v /home/jh:/home/jh \
m.daocloud.io/quay.io/ascend/vllm-ascend:v0.11.0rc0 \
/bin/bash

进入容器:

docker exec -it -u root omnidocbench_test bash

2.2 安装环境依赖

##下载Omnidocbench代码, 安装依赖包
git clone https://github.com/opendatalab/OmniDocBench.git
cd OmniDocBench
pip install -r requirements.txt -i https://mirrors.huaweicloud.com/repository/pypi/simple --trusted-host mirrors.huaweicloud.com
##安装其他依赖
apt-get update
apt-get install libxml2-dev libxslt-dev libgl1 libglib2.0-0 latexml
pip install transformers==4.57.1
pip install xxhash-3.5.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl # xxhash需要单独下载whl安装,因为pip无法找到此包
pip install statsmodels==0.13.5 scikit-learn==1.1.3 # requirements里这两个包没有cp311版本的whl
pip install -r /root/codes/OmniDocBench-main/requirements.txt # 注意去掉上一行的几个包,huggingface-hub也去掉
pip install qwen_vl_utils
pip install accelerate
pip install scikit-image==0.23.2

2.3 配置CDM环境

步骤1 安装nodejs

wget https://registry.npmmirror.com/-/binary/node/latest-v16.x/node-v16.13.1-linux-arm64.tar.gz
tar -xvf node-v16.13.1-linux-arm64.tar.gz
mv node-v16.13.1-linux-arm64/* /usr/local/nodejs/
ln -s /usr/local/nodejs/bin/node /usr/local/bin
ln -s /usr/local/nodejs/bin/npm /usr/local/bin
node -v

步骤2 安装ImageMagick

apt-get update
# pkg-config没装的话,ImageMagick会找不到png包
# ghostscript不装会报gs not found,同时imagemagick也会报unknown device
apt-get install libpng-dev pkg-config ghostscript
https://github.com/ImageMagick/ImageMagick.git的release里找源码安装包7.1.1-47,7.1.2的不行
##解压并进入ImageMagick
./configure
make
make install
ldconfig /usr/local/lib
convert –version

步骤3 安装texlive

提供了两种安装方法,推荐使用方法2进行安装

# 方法1:版本可能会比较旧,导致mathcolor命令识别不了(不推荐)
apt-get update
apt-get install texlive-full

# 方法2(推荐):安装新版本
wget https://mirror.ctan.org/systems/texlive/tlnet/install-tl-unx.tar.gz
tar xzf install-tl-unx.tar.gz
cd install-tl-*
./install-tl
fmtutil-sys --all # 解决一些fmt未找到的问题I can't find the format file `xelatex.fmt'!

步骤4 安装Python依赖

##下载UniMERNet:
git clone https://github.com/opendatalab/UniMERNet.git
cd UniMERNet/cdm/
pip install -r requirements.txt

2.4 常见问题

1、使用CDM的时候出现I can't find the format file `xelatex.fmt'!的问题:执行下列命令创建丢失的fmt文件

fmtutil-sys --all

2、评测时出现KeyError: 'latex'的报错

这个问题的原因是模型识别table的时候是以latex格式输出的。虽然prompt要求用html格式输出,但模型不一定会遵从。解决方法是在OmniDocBench/utils/extract.py文件增加一行import和修改部分代码,如下:

from utils.data_preprocess import normalized_latex_table
……
    # extract latex table
    latex_table_array, table_positions = extract_tex_table(content)
    for latex_table, position in zip(latex_table_array, table_positions):
        position = [position[0], position[0]+len(latex_table)] # !!!
        pred_all.append({
              'category_type': 'html_table', # 从latex_table改成html_table
              'position': position,
              'content': normalized_latex_table(latex_table) # 使用normalized_latex_table将latex格式转换成html格式
 })
content = content[:position[0]] + ' '*(position[1]-position[0]) + content[position[1]:] # replace latex table with space
……

3 多模态大模型推理部署及精度对齐

多模态大模型在OmniDocBench数据集上需要先推理出markdown结果,然后再对结果进行评测,这里多模态大模型的部署推理有三种方式:

3.1 transformer直接推理(速度较慢)

这里以qwen2.5vl-7b为例(参考OmniDocBench代码里的tools/model_infer/Qwen2VL_img2md.py脚本修改得到):

import json
import os
import base64
from tqdm import tqdm
from transformers import set_seed
import os
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

def set_seed(seed):
    import random
    import numpy as np
    import torch
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    torch.cuda.manual_seed(seed)


# 指定模型的本地存储目录
model_dir =  "/root/codes/temp_models/Qwen2.5-VL-7B-Instruct/"

# 加载模型
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype=torch.bfloat16, device_map="auto",
    # attn_implementation = "flash_attention_2"
)
# 加载处理器
processor = AutoProcessor.from_pretrained(model_dir)

# 设置输入和输出的基目录
input_dir = '/root/modelscope/datasets/evalscope/OmniDocBench/images'
output_dir = "/root/outputs/DocParseEval/demo_data/end2end_full_qwen2.5vl7b"


if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# complex prompt
prompt = r'''You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:

        1. Text Processing:
        - Accurately recognize all text content in the PDF image without guessing or inferring.
        - Convert the recognized text into Markdown format.
        - Maintain the original document structure, including headings, paragraphs, lists, etc.

        2. Mathematical Formula Processing:
        - Convert all mathematical formulas to LaTeX format.
        - Enclose inline formulas with $ $. For example: This is an inline formula $ E = mc^2 $
        - Enclose block formulas with \

$$ \$$

. For example: 

$$ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} $$



        3. Table Processing:
        - Convert tables to HTML format.
        - Wrap the entire table with <table> and </table>.

        4. Figure Handling:
        - Ignore figures content in the PDF image. Do not attempt to describe or convert images.

        5. Output Format:
        - Ensure the output Markdown document has a clear structure with appropriate line breaks between elements.
        - For complex layouts, try to maintain the original document's structure and format as closely as possible.

        Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.
        '''

image_extensions = ('.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.webp')

# 遍历目录及其子目录下的所有文件,并过滤出图片文件
for root, _, files in os.walk(input_dir):
    for name in files:
        if any(name.lower().endswith(ext) for ext in image_extensions):
            
            # 构建完整的文件路径
            image_path = os.path.join(root, name)
            
            # 提取出不包含文件后缀的文件名basename
            basename = os.path.splitext(name)[0]
            # 构建Markdown文件的完整路径
            markdown_file = os.path.join(output_dir, f"{basename}.md")

            # 如果markdown_file文件存在,则跳过
            if os.path.exists(markdown_file):
                print(f"文件已存在,跳过: {markdown_file}", flush=True)
                continue
            
            # 设置请求消息内容
            messages = [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "image": image_path,
                            # "max_pixels":2048*2048
                        },
                        {"type": "text", "text": prompt},
                    ],
                }
            ]

            text = processor.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
            image_inputs, video_inputs = process_vision_info(messages)
            inputs = processor(
                text=[text],
                images=image_inputs,
                videos=video_inputs,
                padding=True,
                return_tensors="pt",
            )
            inputs = inputs.to("cuda")

            set_seed(0)
            generated_ids = model.generate(**inputs, max_new_tokens=32000, temperature=0.01,do_sample=False)
            generated_ids_trimmed = [
                out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
            ]
            output_text = processor.batch_decode(
                generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
            )

            # 构建Markdown文件的完整路径
            markdown_file = os.path.join(output_dir, f"{basename}.md")
            
            # 将响应写入Markdown文件
            with open(markdown_file, 'w', encoding='utf-8') as file:
                file.write(output_text[0])
                print(f"Saved: {markdown_file}", flush=True)

3.2 vllm直接推理

这里以qwen2.5vl-7b为例,使用vllm直接推理代码:

import json
import os
import base64
from tqdm import tqdm
from transformers import set_seed
import os
from vllm import LLM, SamplingParams
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

def set_seed(seed):
    import random
    import numpy as np
    import torch
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    torch.cuda.manual_seed(seed)

# 图像预处理函数
def process_image_for_vl(image_path):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()
    return f"data:image/png;base64,{image_data}"

# 指定模型的本地存储目录
model_dir =  "/root/models/Qwen/Qwen2.5-VL-7B-Instruct/"

# 加载模型
llm = LLM(
    model=model_dir,
    tensor_parallel_size=1,  # 单GPU设为1,多GPU可增加
    enable_expert_parallel=False,
    # gpu_memory_utilization=0.9,
    max_model_len=32768,
    trust_remote_code=True,
    dtype="bfloat16"
)
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=32000
)
print("***********\n",sampling_params,flush=True)
# 加载处理器
processor = AutoProcessor.from_pretrained(model_dir)

# 设置输入和输出的基目录
input_dir = '/root/modelscope/datasets/evalscope/OmniDocBench/images'
output_dir = "/root/outputs/DocParseEval/demo_data/test"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# complex prompt
prompt = r'''You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:

        1. Text Processing:
        - Accurately recognize all text content in the PDF image without guessing or inferring.
        - Convert the recognized text into Markdown format.
        - Maintain the original document structure, including headings, paragraphs, lists, etc.

        2. Mathematical Formula Processing:
        - Convert all mathematical formulas to LaTeX format.
        - Enclose inline formulas with $ $. For example: This is an inline formula $ E = mc^2 $
        - Enclose block formulas with \

$$ \$$

. For example: 

$$ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} $$



        3. Table Processing:
        - Convert tables to HTML format.
        - Wrap the entire table with <table> and </table>.

        4. Figure Handling:
        - Ignore figures content in the PDF image. Do not attempt to describe or convert images.

        5. Output Format:
        - Ensure the output Markdown document has a clear structure with appropriate line breaks between elements.
        - For complex layouts, try to maintain the original document's structure and format as closely as possible.

        Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.
        '''

image_extensions = ('.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.webp')

# 遍历目录及其子目录下的所有文件,并过滤出图片文件
for root, _, files in os.walk(input_dir):
    for name in files:
        if any(name.lower().endswith(ext) for ext in image_extensions):
            
            # 构建完整的文件路径
            image_path = os.path.join(root, name)
            
            # 提取出不包含文件后缀的文件名basename
            basename = os.path.splitext(name)[0]
            # 构建Markdown文件的完整路径
            markdown_file = os.path.join(output_dir, f"{basename}.md")

            # 如果markdown_file文件存在,则跳过
            if os.path.exists(markdown_file):
                print(f"文件已存在,跳过: {markdown_file}", flush=True)
                continue
            
            # image_path
            # 设置请求消息内容
            messages = [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "image": image_path,
                            # "max_pixels":2048*2048
                        },
                        {"type": "text", "text": prompt},
                    ],
                }
            ]

            text = processor.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
            image_inputs, video_inputs = process_vision_info(messages)
            inputs = {
                "prompt": text,
                "multi_modal_data": {
                    "image": image_inputs
                }
            }

            set_seed(0)
            outputs = llm.generate([inputs], sampling_params)
            generated_text = outputs[0].outputs[0].text

            # 构建Markdown文件的完整路径
            markdown_file = os.path.join(output_dir, f"{basename}.md")
            
            # 将响应写入Markdown文件
            with open(markdown_file, 'w', encoding='utf-8') as file:
                file.write(generated_text)
                print(f"Saved: {markdown_file}", flush=True)

3.3 vllm服务化推理及精度对齐

1、启动服务化

# 关键点:
# max-model-len:单次任务最多的token,如果请求图片可能较大,这个就不能太小
# max_num_batched_tokens:单次运输最多的token,条件允许的情况下还是大点。如果不设置,则默认为max-model-len*max-num-seqs
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 nohup vllm serve /root/models/Qwen/Qwen2.5-VL-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-num-seqs 2 \
    --max_num_batched_tokens 20000 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --dtype=bfloat16 \
    --served-model-name qwen2.5-vl-7b > /root/outputs/qwen2.5-vl-7b.log 2>&1 &

2、精度对齐API调用脚本

1、对齐temperature、top_p、top_k、repetition_penalty参数;2、图像需经过process_vision_info处理后传出(服务化内部不会自动调用),注意,不要处理完存成文件再读取成base64,可能有误差。

from openai import OpenAI
import httpx
import os
import json
import yaml
from argparse import ArgumentParser
from qwen_vl_utils import process_vision_info
import base64
from io import BytesIO
from PIL import Image

def parser():
    argparser = ArgumentParser(usage="Usage: python {}[--cfg <config file>]")
    argparser.add_argument('--cfg', type=str,
                            default=os.path.join(os.path.dirname(__file__),'Qwen2.5VL.yaml'),
                            help='config file')
    arg = argparser.parse_args()
    with open(arg.cfg,'r') as f:
        config = yaml.load(f,yaml.FullLoader)
    return config

def PIL_to_base64(image,fmt="png"):
    output_buffer = BytesIO()
    image.save(output_buffer,format=fmt)
    byte_data = output_buffer.getvalue()
    return base64.b64encode(byte_data).decode('utf-8')
def image_to_base64(file_path):
    with open(file_path, "rb") as f:
        image_byte = f.read()
    return base64.b64encode(image_byte).decode('utf-8')

IMAGE_FORMAT = {
    "png": "png",
    "apng": "apng",
    "avif": "avif",
    "gif": "gif",
    "jpeg": "jpeg",
    "jpg": "jpeg",
    "jfif": "jpeg",
    "pjpeg": "jpeg",
    "pjp": "jpeg",
    "svg": "svg+xml",
    "webp": "webp",
}

def transform_image_format(image_ext):
    image_ext = image_ext.lower()
    assert image_ext in IMAGE_FORMAT, "unknown image extension: {}".format(image_ext)
    return IMAGE_FORMAT[image_ext]

def check_content_type(content, force_base64=True):
    c_type = "text"
    if isinstance(content, Image.Image):
        c_type = "image_url"
        content = "data:image/{};base64,{}".format(
                IMAGE_FORMAT["png"],PIL_to_base64(content))
    elif ";base64," in content:
        c_type = "image_url"
    else:
        image_ext = content.split(".")[-1].lower()
        if image_ext in IMAGE_FORMAT:
            c_type = "image_url"
            if force_base64:
                content = "data:image/{};base64,{}".format(
                        IMAGE_FORMAT[image_ext],image_to_base64(content))
    return c_type, content

class MLLM:
    def __init__(self, config):
        self._client = OpenAI(
            api_key=config["api_key"],
            base_url=config["base_url"],
            timeout=httpx.Timeout(7200.0, read=7200.0, write=7200.0, connect=7200.0)
        )
        self._model = config["model"]
        self._system_prompt = config.get("system_prompt","")
        self._force_base64 = config.get("force_base64",True)

    def __call__(self, system_prompt, user_contents, **kwargs):
        verbose_mode = kwargs.get("verbose_mode", False)
        if verbose_mode:
            print("input system_prompt:\n",system_prompt,flush=True)
            print("input user_contents:\n",user_contents,flush=True)
        messages = []
        if system_prompt is None or system_prompt=="":
            system_prompt = self._system_prompt
        if system_prompt is not None and system_prompt!="":
            messages.append({
                "role": "system",
                "content": system_prompt
            })
        user_message = None
        for u_content in user_contents:
            if u_content is None or u_content=="":
                continue
            if user_message is None:
                user_message = {
                    "role": "user",
                    "content": []
                }
            c_type, u_content = check_content_type(u_content,self._force_base64)
            user_message["content"].append(
                {
                    "type": "text",
                    "text": u_content
                } if c_type=="text" else \
                {
                    "type": "image_url",
                    "image_url": {
                        "url": u_content
                    }
                }
            )
        if user_message is not None:
            messages.append(user_message)
        if verbose_mode:
            print("message:\n",messages,flush=True)
        completion = self._client.chat.completions.create(
            model=self._model,
            messages=messages,
            temperature=0,
            top_p=1,
            extra_body={  # 将不兼容的参数放在这里
                "repetition_penalty": 1,
                "top_k": 1
            }
        )
        if verbose_mode:
            print("completion:\n",completion,flush=True)
        return completion.choices[0].message.content

if __name__=='__main__':
    config = parser()
    model = MLLM(config)
    # 设置输入和输出的基目录
    input_dir = '/root/modelscope/datasets/evalscope/OmniDocBench/images'
    output_dir = "/root/outputs/DocParseEval/demo_data/end2end_full_vllm_api_qwen2.5vl7b_align"

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # complex prompt
    prompt = r'''You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:

            1. Text Processing:
            - Accurately recognize all text content in the PDF image without guessing or inferring.
            - Convert the recognized text into Markdown format.
            - Maintain the original document structure, including headings, paragraphs, lists, etc.

            2. Mathematical Formula Processing:
            - Convert all mathematical formulas to LaTeX format.
            - Enclose inline formulas with $ $. For example: This is an inline formula $ E = mc^2 $
            - Enclose block formulas with \

$$ \$$

. For example: 

$$ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} $$



            3. Table Processing:
            - Convert tables to HTML format.
            - Wrap the entire table with <table> and </table>.

            4. Figure Handling:
            - Ignore figures content in the PDF image. Do not attempt to describe or convert images.

            5. Output Format:
            - Ensure the output Markdown document has a clear structure with appropriate line breaks between elements.
            - For complex layouts, try to maintain the original document's structure and format as closely as possible.

            Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.
            '''

    image_extensions = ('.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.webp')

    # 遍历目录及其子目录下的所有文件,并过滤出图片文件
    for root, _, files in os.walk(input_dir):
        for name in files:
            if any(name.lower().endswith(ext) for ext in image_extensions):
                
                # 构建完整的文件路径
                image_path = os.path.join(root, name)
                
                # 提取出不包含文件后缀的文件名basename
                basename = os.path.splitext(name)[0]
                # 构建Markdown文件的完整路径
                markdown_file = os.path.join(output_dir, f"{basename}.md")

                # 如果markdown_file文件存在,则跳过
                if os.path.exists(markdown_file):
                    print(f"文件已存在,跳过: {markdown_file}", flush=True)
                    continue
                
                # 设置请求消息内容
                messages = [
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "image",
                                "image": image_path,
                                # "max_pixels":2048*2048
                            },
                            {"type": "text", "text": prompt},
                        ],
                    }
                ]

                image_inputs, video_inputs = process_vision_info(messages)
                try:
                    print(f"Processing: {image_path}", flush=True)
                    output = model(None,[image_inputs[0],prompt])
                except Exception as e:
                    print(e,flush=True)
                    print("file failed, image_path: ", image_path, flush=True)
                    import sys
                    sys.exit()
                
                # 将响应写入Markdown文件
                with open(markdown_file, 'w', encoding='utf-8') as file:
                    file.write(output)
                    print(f"Saved: {markdown_file}", flush=True)

服务配置:

# Qwen2.5VL.yaml
base_url: http://71.10.29.134:8000/v1
model: qwen2.5-vl-7b
api_key: test
force_base64: true

4 OmniDocBench数据集端到端评测

4.1 配置configs/end2end.yaml文件

将ground_truth的data_path修改为OmniDocBench数据集里的OmniDocBench.json路径,将prediction的data_path修改为章节2.2中多模态大模型预测的markdown结果路径,CDM_plain修改为CDM。

end2end_eval:
  metrics:
    text_block:
      metric:
        - Edit_dist
        # - BLEU
        # - METEOR
    display_formula:
      metric:
        - Edit_dist
       - CDM
       ##- CDM_plain  # CDM can be calculated directly by calling CDM in config file if you have CDM environment
    table:
      metric:
        - TEDS
        - Edit_dist
    reading_order:
      metric:
        - Edit_dist
  dataset:
    dataset_name: end2end_dataset
    ground_truth:
      data_path: /root/modelscope/datasets/evalscope/OmniDocBench/OmniDocBench.json
    prediction:
      data_path: /root/outputs/DocParseEval/demo_data/end2end_full_vllm_local_qwen2.5vl7b
    match_method: quick_match
    # filter: 
    #   language: english

4.2 执行评测

配置完成后,执行以下命令进行一键评测

python pdf_validation.py --config ./configs/end2end.yaml

评测结果会保存在result目录下

4.3 评测结果可视化

执行tools/generate_result_tables.ipynb脚本可以对结果进行可视化

import os
import pandas as pd
import numpy as np
import json

ocr_types_dict = {
    'end2end': 'end2end'
}

result_folder = '../result'

match_name = 'quick_match'
# overall result: not distinguishing between Chinese and English, page-level average
dict_list = []

for ocr_type in ocr_types_dict.values():
    result_path = os.path.join(result_folder, f'{ocr_type}_{match_name}_metric_result.json')

    with open(result_path, 'r') as f:
        result = json.load(f)    
    save_dict = {}
    for category_type, metric in [("text_block", "Edit_dist"), ("display_formula", "CDM"), ("table", "TEDS"), ("table", "TEDS_structure_only"), ("reading_order", "Edit_dist")]:
        if metric == 'CDM' or metric == "TEDS" or metric == "TEDS_structure_only":
            if result[category_type]["page"].get(metric):
                save_dict[category_type+'_'+metric] = result[category_type]["page"][metric]["ALL"] * 100   # page级别的avg
            else:
                save_dict[category_type+'_'+metric] = 0
        else:
            save_dict[category_type+'_'+metric] = result[category_type]["all"][metric].get("ALL_page_avg", np.nan)
    dict_list.append(save_dict)
    
df = pd.DataFrame(dict_list, index=ocr_types_dict.keys()).round(3)
df['overall'] = ((1-df['text_block_Edit_dist'])*100 + df['display_formula_CDM'] + df['table_TEDS'])/3
df.to_csv('./overall.csv')

可视化结果如下:

text_block_Edit_distdisplay_formula_CDMtable_TEDStable_TEDS_structure_onlyreading_order_Editoverall
end2end0.01687.65875.76181.0270.1383.94

5 效果/价值

5.1 多模态大模型OmniDocBench上的评测结果汇总

注:黑体为官方评测榜单结果,带有(A)的为我们在昇腾上的测试结果,颜色标记含义为:已对齐 | 未对齐 | 没有官方结果,但大体上无问题

Model_TypeModelsizeoveralltext_EDformula_CDMtable_TEDStable_TEDS-Sreading_order_ED
Specialized VLMsHunyuanOCR1B 94.100.04294.7391.81--
HunyuanOCR(A)1B 94.47 0.03989.1798.12898.4780.039
PaddleOCR-VL0.9B 91.93 0.03988.6791.0194.850.048
MinerU2.51.2B 90.67 0.04788.4688.2292.380.044
MonkeyOCR-pro-3B3B 88.85 0.07587.2586.7890.630.128
OCRVerse4B 88.560.05886.9184.5588.450.071
dots.ocr1.7B88.410.04883.2286.7890.620.053
dots.ocr(A)1.7B88.20.04782.7986.5190.40.051
MonkeyOCR-3B3B87.130.07587.4581.3985.920.129
Deepseek-OCR3B87.010.07383.3784.9788.80.086
Deepseek-OCR(A,vllm-ascend)3B80.140.12685.97267.05371.6830.13
General VLMsQwen3-VL-235B -A22B-Instruct235B 89.150.06988.1486.2190.550.068
Qwen3-VL(A,vllm-ascend)30B-A3B 85.32 0.06384.20778.04582.4440.078
Qwen3-VL(A,vllm-ascend,api)30B-A3B 85.52 0.06184.4878.16982.4270.076
Qwen3-VL(A,vllm-ascend)8B 88.04 0.04886.18382.72387.310.065
Qwen3-VL(A,vllm-ascend,api)8B 88.12 0.04886.25482.90487.6080.067
Gemini-2.5 Pro-88.030.07585.8285.7190.290.097
Qwen2.5-VL72B87.020.09488.2782.1586.220.102
Qwen2.5-VL(A,vllm-ascend)72B87.3640.09288.4382.86287.2550.097
Qwen2.5-VL(A,transformers4.57.1)7B 83.94 0.116 87.65875.76181.0270.13
Qwen2.5-VL(A,vllm-ascend)7B 80.53 0.169 85.904 72.58477.067 0.171
Qwen2.5-VL(A,vllm-ascend,api)7B 82.74 0.12886.89874.12678.746 0.133
InternVL3.5241B82.670.14287.237581.280.125
Pangu_mllm_v3(A)7B79.180.13382.36668.46375.5340.166
GPT-4o-75.020.21779.767.0776.090.148

5.2 价值

本案例搭建了一套用于评估多模态大模型文档解析能力的OmninDocBench数据集评测环境,为后续评测多模态大模型的文档解析能力提供了可直接测试的环境和方法,大大提高了多模态大模型文档能力评测的效率;