怹

PaddleOCR-VL Ascend NPU 适配

#NPU

本仓库包含PaddleOCR-VL的Ascend NPU适配方案，PaddleOCR-VL是一款用于OCR和文档理解任务的视觉语言模型。

项目	详情
模型	PaddleOCR-VL
原始框架	PaddlePaddle
适配框架	PyTorch / Transformers
NPU后端	Ascend CANN + torch_npu
任务	视觉语言 / OCR / 文档理解
状态	已适配并验证

环境要求

组件	测试版本
操作系统	Linux (aarch64)
Python	3.11
CANN工具包	8.5.1
PyTorch	2.9.0+cpu
torch_npu	2.9.0.post1
transformers	>= 4.35.0

验证NPU环境

npu-smi info
python -c "import torch; import torch_npu; print(torch.npu.is_available())"

安装

# 1. Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate

# 2. Install dependencies
pip install -r requirements.txt

# 3. Verify installation
python -c "import torch_npu; print('torch_npu:', torch_npu.__version__)"

模型权重

选项 A：从 ModelScope 下载（中国用户推荐）

pip install modelscope
python -c "from modelscope import snapshot_download; snapshot_download('paddlepaddle/PaddleOCR-VL', cache_dir='./PaddleOCR-VL-weights')"

选项 B：从 HuggingFace 下载

pip install huggingface-hub
huggingface-cli download paddlepaddle/PaddleOCR-VL --local-dir ./PaddleOCR-VL-weights

选项 C：手动下载

访问模型页面并将以下文件下载到 ./PaddleOCR-VL-weights/ 目录中：

config.json
tokenizer.json
tokenizer_config.json
model.safetensors 或 pytorch_model.bin
preprocessor_config.json

快速开始

单图像推理

python inference.py \
    --model_path ./PaddleOCR-VL-weights/paddlepaddle/PaddleOCR-VL \
    --image sample.jpg \
    --prompt "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>Describe this document." \
    --device npu \
    --dtype float32

参数

参数	描述	默认值
`--model_path`	模型目录路径	（必填）
`--image`	输入图像路径（可多个）	（必填）
`--prompt`	包含图像占位符的文本提示	见上文
`--device`	设备：cpu / cuda / npu	npu
`--dtype`	数据类型：float16 / bfloat16 / float32	float32

精度评估

运行精度评估脚本，比较 CPU（基准）与 NPU 的输出结果。

python eval_accuracy.py \
    --model_path ./PaddleOCR-VL-weights/paddlepaddle/PaddleOCR-VL \
    --test_images test_images/test_doc_01.jpg \
    --prompt "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>Describe this document." \
    --threshold_mre 1.0 \
    --threshold_top1 99.0 \
    --dtype float32 \
    --output accuracy_report.md

通过标准

指标	阈值	实际结果	状态
MRE (eps=1e-3)	< 1.00%	0.6408%	通过
Top-1 令牌准确率	>= 99.0%	99.52%	通过
余弦相似度	> 0.999	0.999919	通过

结论：通过——NPU 推理在数值上与 CPU 基准对齐。

为什么 MRE 使用 eps=1e-3？

标准相对误差（eps=1e-12）在 logits 接近零时会急剧增大，产生 8,000,000% 这类无意义的百分比。对于范围在 [-15, +25] 内的 logits，在分母中加上 1e-3 在数学上是合理的，能够得到稳定且可解释的指标。

为什么 Top-1 令牌准确率是主要指标？

对于 LLM 而言，最重要的问题是："模型是否生成相同的令牌？" Top-1 准确率直接衡量这一点。我们的结果 99.52% 意味着在 420 个序列位置中，有 418 个位置在 CPU 和 NPU 上产生完全相同的 argmax 令牌。

脚本会生成包含详细错误表和结论的 accuracy_report.md。

性能基准测试

运行性能基准测试以比较 CPU 和 NPU 之间的延迟和吞吐量。

python eval_performance.py \
    --model_path ./PaddleOCR-VL-weights/paddlepaddle/PaddleOCR-VL \
    --image test_images/test_doc_01.jpg \
    --prompt "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>Describe this document." \
    --devices cpu npu \
    --num_runs 20 \
    --warmup 3 \
    --dtype float32 \
    --output performance_report.md

指标

指标	描述
Avg Latency	平均端到端前向传播时间
P50 / P99	延迟百分位数
First Token	首次前向传播完成时间
Throughput	每秒生成的 Logits 元素数量
Speedup	NPU 相对 CPU 的加速比

注意： 仅当 CUDA 可用时才显示 GPU 列。在我们的测试环境（无独立 GPU 的 ARM64 服务器）中，仅对 CPU 和 NPU 进行比较。

常见问题

Q1: `RuntimeError: NPU out of memory`

解决方案：

降低图像分辨率或批量大小
使用 float16 替代 float32
确保没有其他进程在使用 NPU：npu-smi info

Q2: `ImportError: torch_npu`

解决方案：

# Verify CANN environment is sourced
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# Reinstall torch_npu matching your PyTorch version
pip install torch_npu==$(python -c "import torch; print(torch.__version__[:5])")

Q3: `"Image features and image tokens do not match: tokens: 0, features 196"`

解决方案： 您的提示词中未包含所需的图像占位符标记。请使用：

prompt = "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>Describe this document."

处理器会根据输入图像的分辨率，将 <|IMAGE_PLACEHOLDER|> 替换为正确数量的图像 tokens。

Q4：CPU 和 NPU 之间的模型输出存在差异

解决方案：

进行精度比较时使用 float32（float16 存在预期的微小数值差异）
测量时间前确保调用 torch.npu.synchronize()
提供的 eval_accuracy.py 已正确处理这些边缘情况

Q5：首次运行时 NPU 推理速度慢

解决方案：

首次运行包含图编译过程；后续运行速度会显著提升
基准测试脚本包含自动预热轮次

Q6：`ValueError: Unrecognized configuration class PaddleOCRVLConfig for AutoModelForVision2Seq`

解决方案： 该模型未在标准 AutoModel 类中注册。如 inference.py 和 eval_accuracy.py 中所示，使用 AutoModel.from_pretrained(..., trust_remote_code=True) 替代。

Q7：如何为 AtomGit 提交添加 #NPU 标签？

详细说明请参见 SUBMISSION.md。

许可证

本适配遵循原始 PaddleOCR-VL 模型的许可证。

致谢

原始模型：paddlepaddle/PaddleOCR-VL
昇腾 NPU 工具链：torch_npu

#NPU

PaddleOCR-VL Ascend NPU 适配

#NPU

本仓库包含PaddleOCR-VL的Ascend NPU适配方案，PaddleOCR-VL是一款用于OCR和文档理解任务的视觉语言模型。

项目	详情
模型	PaddleOCR-VL
原始框架	PaddlePaddle
适配框架	PyTorch / Transformers
NPU后端	Ascend CANN + torch_npu
任务	视觉语言 / OCR / 文档理解
状态	已适配并验证

环境要求

组件	测试版本
操作系统	Linux (aarch64)
Python	3.11
CANN工具包	8.5.1
PyTorch	2.9.0+cpu
torch_npu	2.9.0.post1
transformers	>= 4.35.0

验证NPU环境

npu-smi info
python -c "import torch; import torch_npu; print(torch.npu.is_available())"

安装

# 1. Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate

# 2. Install dependencies
pip install -r requirements.txt

# 3. Verify installation
python -c "import torch_npu; print('torch_npu:', torch_npu.__version__)"

模型权重

选项 A：从 ModelScope 下载（中国用户推荐）

pip install modelscope
python -c "from modelscope import snapshot_download; snapshot_download('paddlepaddle/PaddleOCR-VL', cache_dir='./PaddleOCR-VL-weights')"

选项 B：从 HuggingFace 下载

pip install huggingface-hub
huggingface-cli download paddlepaddle/PaddleOCR-VL --local-dir ./PaddleOCR-VL-weights

选项 C：手动下载

访问模型页面并将以下文件下载到 ./PaddleOCR-VL-weights/ 目录中：

config.json
tokenizer.json
tokenizer_config.json
model.safetensors 或 pytorch_model.bin
preprocessor_config.json

快速开始

单图像推理

python inference.py \
    --model_path ./PaddleOCR-VL-weights/paddlepaddle/PaddleOCR-VL \
    --image sample.jpg \
    --prompt "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>Describe this document." \
    --device npu \
    --dtype float32

参数

参数	描述	默认值
`--model_path`	模型目录路径	（必填）
`--image`	输入图像路径（可多个）	（必填）
`--prompt`	包含图像占位符的文本提示	见上文
`--device`	设备：cpu / cuda / npu	npu
`--dtype`	数据类型：float16 / bfloat16 / float32	float32

精度评估

运行精度评估脚本，比较 CPU（基准）与 NPU 的输出结果。

python eval_accuracy.py \
    --model_path ./PaddleOCR-VL-weights/paddlepaddle/PaddleOCR-VL \
    --test_images test_images/test_doc_01.jpg \
    --prompt "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>Describe this document." \
    --threshold_mre 1.0 \
    --threshold_top1 99.0 \
    --dtype float32 \
    --output accuracy_report.md

通过标准

指标	阈值	实际结果	状态
MRE (eps=1e-3)	< 1.00%	0.6408%	通过
Top-1 令牌准确率	>= 99.0%	99.52%	通过
余弦相似度	> 0.999	0.999919	通过

结论：通过——NPU 推理在数值上与 CPU 基准对齐。

为什么 MRE 使用 eps=1e-3？

为什么 Top-1 令牌准确率是主要指标？

脚本会生成包含详细错误表和结论的 accuracy_report.md。

性能基准测试

运行性能基准测试以比较 CPU 和 NPU 之间的延迟和吞吐量。

python eval_performance.py \
    --model_path ./PaddleOCR-VL-weights/paddlepaddle/PaddleOCR-VL \
    --image test_images/test_doc_01.jpg \
    --prompt "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>Describe this document." \
    --devices cpu npu \
    --num_runs 20 \
    --warmup 3 \
    --dtype float32 \
    --output performance_report.md

指标

指标	描述
Avg Latency	平均端到端前向传播时间
P50 / P99	延迟百分位数
First Token	首次前向传播完成时间
Throughput	每秒生成的 Logits 元素数量
Speedup	NPU 相对 CPU 的加速比

注意： 仅当 CUDA 可用时才显示 GPU 列。在我们的测试环境（无独立 GPU 的 ARM64 服务器）中，仅对 CPU 和 NPU 进行比较。

常见问题

Q1: `RuntimeError: NPU out of memory`

解决方案：

降低图像分辨率或批量大小
使用 float16 替代 float32
确保没有其他进程在使用 NPU：npu-smi info

Q2: `ImportError: torch_npu`

解决方案：

# Verify CANN environment is sourced
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# Reinstall torch_npu matching your PyTorch version
pip install torch_npu==$(python -c "import torch; print(torch.__version__[:5])")

Q3: `"Image features and image tokens do not match: tokens: 0, features 196"`

解决方案： 您的提示词中未包含所需的图像占位符标记。请使用：

prompt = "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>Describe this document."

处理器会根据输入图像的分辨率，将 <|IMAGE_PLACEHOLDER|> 替换为正确数量的图像 tokens。

Q4：CPU 和 NPU 之间的模型输出存在差异

解决方案：

进行精度比较时使用 float32（float16 存在预期的微小数值差异）
测量时间前确保调用 torch.npu.synchronize()
提供的 eval_accuracy.py 已正确处理这些边缘情况

Q5：首次运行时 NPU 推理速度慢

解决方案：

首次运行包含图编译过程；后续运行速度会显著提升
基准测试脚本包含自动预热轮次

Q6：`ValueError: Unrecognized configuration class PaddleOCRVLConfig for AutoModelForVision2Seq`

解决方案： 该模型未在标准 AutoModel 类中注册。如 inference.py 和 eval_accuracy.py 中所示，使用 AutoModel.from_pretrained(..., trust_remote_code=True) 替代。

Q7：如何为 AtomGit 提交添加 #NPU 标签？

详细说明请参见 SUBMISSION.md。

许可证

本适配遵循原始 PaddleOCR-VL 模型的许可证。

致谢

原始模型：paddlepaddle/PaddleOCR-VL
昇腾 NPU 工具链：torch_npu

#NPU

PaddleOCR-VL Ascend NPU 适配

目录

环境要求

验证NPU环境

安装

模型权重

选项 A：从 ModelScope 下载（中国用户推荐）

选项 B：从 HuggingFace 下载

选项 C：手动下载

快速开始

单图像推理

参数

精度评估

通过标准

为什么 MRE 使用 eps=1e-3？

为什么 Top-1 令牌准确率是主要指标？

性能基准测试

指标

常见问题

Q1: RuntimeError: NPU out of memory

Q2: ImportError: torch_npu

Q3: "Image features and image tokens do not match: tokens: 0, features 196"

Q4：CPU 和 NPU 之间的模型输出存在差异

Q5：首次运行时 NPU 推理速度慢

Q6：ValueError: Unrecognized configuration class PaddleOCRVLConfig for AutoModelForVision2Seq

Q7：如何为 AtomGit 提交添加 #NPU 标签？

许可证

致谢

PaddleOCR-VL Ascend NPU 适配

目录

环境要求

验证NPU环境

安装

模型权重

选项 A：从 ModelScope 下载（中国用户推荐）

选项 B：从 HuggingFace 下载

选项 C：手动下载

快速开始

单图像推理

参数

精度评估

通过标准

为什么 MRE 使用 eps=1e-3？

为什么 Top-1 令牌准确率是主要指标？

性能基准测试

指标

常见问题

Q1: RuntimeError: NPU out of memory

Q2: ImportError: torch_npu

Q3: "Image features and image tokens do not match: tokens: 0, features 196"

Q4：CPU 和 NPU 之间的模型输出存在差异

Q5：首次运行时 NPU 推理速度慢

Q6：ValueError: Unrecognized configuration class PaddleOCRVLConfig for AutoModelForVision2Seq

Q7：如何为 AtomGit 提交添加 #NPU 标签？

许可证

致谢

Q1: `RuntimeError: NPU out of memory`

Q2: `ImportError: torch_npu`

Q3: `"Image features and image tokens do not match: tokens: 0, features 196"`

Q6：`ValueError: Unrecognized configuration class PaddleOCRVLConfig for AutoModelForVision2Seq`

Q1: `RuntimeError: NPU out of memory`

Q2: `ImportError: torch_npu`

Q3: `"Image features and image tokens do not match: tokens: 0, features 196"`

Q6：`ValueError: Unrecognized configuration class PaddleOCRVLConfig for AutoModelForVision2Seq`