zkx_/TahaDouaji--detr-doc-table-detection-ascend

TahaDouaji--detr-doc-table-detection-ascend:TahaDouaji--detr-doc-table-detection NPU adaptation - AtomGit AI社区

TahaDouaji/detr-doc-table-detection on Ascend NPU

1. 简介

本文档记录 TahaDouaji/detr-doc-table-detection DETR 表格检测模型在昇腾 NPU（Ascend 910B3）上的迁移适配、精度评测与性能验证结果。

DETR（DEtection TRansformer）是 Facebook Research 提出的端到端目标检测模型，将目标检测转化为集合预测问题。该模型基于 ResNet-50 backbone + Transformer Encoder-Decoder，在文档表格数据集上微调，专门检测文档图像中的表格（table）区域。输入为任意尺寸 RGB 图像，输出检测框坐标（x1, y1, x2, y2）及置信度分数。

2. 验证环境

组件	版本
`torch`	`2.8.0`
`torch_npu`	`2.8.0.post4`
`transformers`	`5.8.1`
`timm`	`1.0.27`
`CANN`	`8.5.1`

NPU：8 × Ascend 910B3
精度对比基准：CPU（x86, PyTorch 2.8.0）
额外依赖：DETR 使用 TimmBackbone，需安装 timm 库

3. 部署使用流程

3.1 环境准备

conda create -n TahaDouaji--detr-doc-table-detection python=3.11 -y
conda activate TahaDouaji--detr-doc-table-detection

pip install torch==2.8.0 torch_npu==2.8.0.post4 timm \
    -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install transformers torchvision pillow numpy \
    -i https://pypi.tuna.tsinghua.edu.cn/simple

DETR 的 ResNet backbone 通过 timm 库加载（TimmBackbone），需显式安装。

3.2 推理脚本使用

python inference.py --image document.jpg --device npu
python inference.py --image_dir ./documents/ --device npu

编程接口：

from inference import DetrDetector
detector = DetrDetector(
    model_path="./TahaDouaji--detr-doc-table-detection", device="npu"
)
results = detector.detect(["document.jpg"], threshold=0.5)
# results[0] → {'scores': [0.95], 'labels': ['table'], 'boxes': [[x1,y1,x2,y2]]}

4. Smoke 验证

python inference.py --image document.jpg --device npu

预期输出：检测到的表格数量、每个框的坐标和置信度。无检测时输出空列表，无运行时错误。

5. 性能参考

测试条件：4 张合成 640×480 图像（固定随机种子），batch_size=2，NPU 预热 1 轮。

指标	数值
CPU 吞吐量	`0.3` img/s
NPU 吞吐量	`8.5` img/s
CPU/NPU 加速比	`28.7` ×

DETR 模型较大（ResNet-50 + 100 object queries），CPU 推理极慢（0.3 img/s）。NPU 通过对 ResNet 卷积和 Transformer 注意力算子的原生加速获得 28.7× 提升，但绝对吞吐仍较低，适合离线批处理。

6. 精度评测

6.1 评测方法

分别在 CPU 和 NPU 上推理 4 张合成图像，比较 DETR 的 100 个 object query 分类 logits（展平后比较）。

6.2 评测结果

指标	数值
平均余弦相似度	`1.000000`
精度误差率	`0.0000%`

结论：精度误差率 0.0000%，NPU 与 CPU logits 完全一致，评测通过。

7. 迁移适配说明

7.1 模型结构

Backbone：ResNet-50（通过 timm 的 TimmBackbone 加载），输出多尺度特征图
Encoder：6 层 Transformer Encoder，对特征图位置编码后全局建模
Decoder：6 层 Transformer Decoder，100 个可学习 object queries 并行预测
Prediction Heads：分类头（100×num_classes） + 回归头（100×4 bounding box），输出检测框
参数量：约 41M（ResNet + Transformer）

7.2 适配要点

使用 DetrForObjectDetection.from_pretrained() 加载，需 timm 库支持 backbone
model.to("npu:0") 迁移，ResNet 卷积 + Transformer 注意力 NPU 原生加速
AutoImageProcessor 在 CPU 完成图像预处理（resize + normalize），tensor 转移至 NPU
后处理（processor.post_process_object_detection）在 CPU 完成：过滤低置信度框 + 坐标映射
ONNX 导出格式可用于进一步跨平台优化

7.3 关键代码

import torch, torch_npu
from PIL import Image
from transformers import AutoImageProcessor, DetrForObjectDetection

model = DetrForObjectDetection.from_pretrained("detr-doc-table-detection").to("npu:0")
processor = AutoImageProcessor.from_pretrained("detr-doc-table-detection")

image = Image.open("document.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
inputs = {k: v.to("npu:0") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    target_sizes = torch.tensor([image.size[::-1]])
    results = processor.post_process_object_detection(
        outputs, target_sizes=target_sizes, threshold=0.5
    )

8. 注意事项

timm 依赖：DETR 的 ResNet backbone 通过 timm 库的 TimmBackbone 加载，缺少 timm 会报 ImportError: TimmBackbone requires the timm library。必须 pip install timm。
模型规模：DETR 包含 ResNet-50（25M）+ Transformer Encoder/Decoder（16M），总计约 41M 参数。NPU 推理时 HBM 占用较大，建议 batch_size ≤ 2。
后处理在 CPU：目标检测的 NMS/阈值过滤/坐标映射等后处理在 CPU 执行，这部分不在 NPU 加速范围内。
输入尺寸：DETR 支持任意尺寸输入（内部会处理），但过大图像（>2000px）会增加 Transformer 序列长度，显著增加推理时间。
首次 NPU 预热：DETR 的 Transformer 层数较多（6+6=12 层），算子编译时间较长（约 10-15 秒），建议生产环境充分预热。
坐标格式：输出 boxes 为 (x1, y1, x2, y2) 格式，已映射回原始图像尺寸。