MinerU-Diffusion 将文档OCR重构为逆向渲染问题,并使用并行扩散解码替代速度慢、易出错的自回归解码。
通过引入块级扩散、不确定性驱动的课程学习,该方法实现了最高3.2倍的解码加速,同时提升了鲁棒性并减少了对语言先验的依赖。
核心优势: MinerU-Diffusion在精度与效率之间保持了出色的平衡,实现2.12倍加速时相对精度达99.9%,3.01倍加速时相对精度仍保持98.8%。
MinerU-Diffusion 通过阈值控制实现了灵活的精度-吞吐量权衡。与 MinerU2.5 相比,其 TPS 最高提升 3.26 倍,同时提供实用的工作点,例如 2.12 倍加速且相对精度达 99.9%,以及 3.01 倍加速且相对精度达 98.8%。
使用 Python 3.12.12 及以下版本依赖:
Python 3.12.12torch 2.8.0+cu128torchvision 0.23.0+cu128torchaudio 2.8.0+cu128transformers >= 4.52.1triton 3.4.0flash-attn 2.8.3liger-kernel 0.6.4安装命令:
pip install --upgrade pip
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install "transformers>=4.52.1"
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install triton==3.4.0 liger-kernel==0.6.4import torch
from transformers import AutoModel, AutoProcessor, AutoTokenizer
model_id = "Niujunbo2002/MinerU-Diffusion-V1-0320-2.5B"
image_path = "path/to/page.png"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
use_fast=False,
)
model = AutoModel.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
).eval().to("cuda")
messages = [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "\nText Recognition:"}]},
]
prompt_text = processor.apply_chat_template(messages, add_generation_prompt=True)
if isinstance(prompt_text, tuple):
prompt_text = prompt_text[0]
inputs = processor(
images=[image_path],
text=prompt_text,
truncation=True,
max_length=4096,
return_tensors="pt",
)
input_ids = inputs["input_ids"].to(torch.long).to("cuda")
pixel_values = inputs["pixel_values"].to(torch.bfloat16).to("cuda")
image_grid_thw = inputs.get("image_grid_thw")
if image_grid_thw is not None:
image_grid_thw = image_grid_thw.to(torch.long).to("cuda")
with torch.no_grad():
generate_outputs = model.generate(
pixel_values=pixel_values,
image_grid_thw=image_grid_thw,
input_ids=input_ids,
mask_token_id=tokenizer.convert_tokens_to_ids("<|MASK|>"),
denoising_steps=32,
gen_length=1024,
block_length=32,
temperature=1.0,
remasking_strategy="low_confidence_dynamic",
dynamic_threshold=0.95,
tokenizer=tokenizer,
stopping_criteria=["<|endoftext|>", "<|im_end|>"],
)
if isinstance(generate_outputs, tuple):
output_ids = generate_outputs[0]
else:
output_ids = generate_outputs
text = tokenizer.decode(output_ids[0], skip_special_tokens=False)
for stop in ("<|endoftext|>", "<|im_end|>"):
text = text.split(stop, 1)[0]
print(text.strip())本工作的构建很大程度上依赖于以下开源模型:
MinerU、Qwen2-VL、SDAR 以及 LLaDA。
这些加速方法(引擎):
SGLang、作为我们 nano_dvlm 适配上游基础的 Nano-vLLM,以及 jetengine,
和理论基础:
MDLM、DiffuLLaMA、Block Diffusion。
在训练代码方面,我们还参考了 dLLM-RL。
如果您发现我们的论文和代码对您的研究有所帮助,请考虑给予星标并引用。
@article{dong2026minerudiffusion,
title={MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding},
author={Dong, Hejun and Niu, Junbo and Wang, Bin and Zeng, Weijun and Zhang, Wentao and He, Conghui},
journal={arXiv preprint arXiv:2603.22458},
year={2026}
}
@article{niu2025mineru2,
title={Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing},
author={Niu, Junbo and Liu, Zheng and Gu, Zhuangcheng and Wang, Bin and Ouyang, Linke and Zhao, Zhiyuan and Chu, Tao and He, Tianyao and Wu, Fan and Zhang, Qintong and others},
journal={arXiv preprint arXiv:2509.22186},
year={2025}
}
@article{wang2024mineru,
title={Mineru: An open-source solution for precise document content extraction},
author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others},
journal={arXiv preprint arXiv:2409.18839},
year={2024}
}
@article{he2024opendatalab,
title={Opendatalab: Empowering general artificial intelligence with open datasets},
author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
journal={arXiv preprint arXiv:2407.13773},
year={2024}
}本项目采用 MIT 许可证。详情请参见 LICENSE 文件。