NanoDet Face-Human-Hand Detection — 昇腾 NPU 部署

ModelScope 模型 iic/cv_nanodet_face-human-hand-detection 在华为昇腾 NPU 上的推理适配与部署文档。

模型概述

属性	值
模型来源	iic/cv_nanodet_face-human-hand-detection
检测类别	人体(body)、人脸(face)、人手(hand)
骨干网络	ShuffleNetV2-1.0x
颈部网络	GhostPAN (Path Aggregation Network with Ghost modules)
检测头	NanoDetPlus (GFL head, reg_max=7)
输入尺寸	320×320 (BGR)
模型大小	17 MB (645 参数张量)
推理框架	PyTorch + torch_npu

环境要求

组件	版本
Python	≥ 3.9
PyTorch	2.9.0
torch_npu	2.9.0.post1
CANN	8.5.1
OpenCV	≥ 4.0
NumPy	≥ 1.20
硬件	Atlas 800 A2 (Ascend 910B) 或兼容设备

快速开始

1. 下载模型

# 通过 ModelScope SDK 下载
pip install modelscope
python -c "
from modelscope import snapshot_download
snapshot_download('iic/cv_nanodet_face-human-hand-detection',
                  cache_dir='./models')
"

模型权重路径: models/iic/cv_nanodet_face-human-hand-detection/pytorch_model.bin

2. 单图推理

# 自动选择设备（NPU 优先）
python inference.py --image test.jpg

# 指定 CPU
python inference.py --image test.jpg --device cpu

# 指定 NPU
python inference.py --image test.jpg --device npu

# 自定义参数
python inference.py \
    --model-path models/iic/cv_nanodet_face-human-hand-detection/pytorch_model.bin \
    --image test.jpg \
    --device npu \
    --input-size 320 \
    --score-thresh 0.3 \
    --output-dir ./output

3. CPU 与 NPU 精度对比

python inference.py --compare

4. 性能基准测试

python inference.py --benchmark --bench-iters 100

推理 API

from inference import load_model, inference, NanoDetPreProcessor, NanoDetPostProcessor

# 加载模型
model = load_model('models/iic/cv_nanodet_face-human-hand-detection/pytorch_model.bin',
                   device='npu:0')

# 推理
labels, bboxes, scores = inference(model, 'test.jpg', device='npu:0',
                                    input_size=320, score_thresh=0.3)

# 结果解读
# labels:  [0, 1, 2] → [人体, 人脸, 人手]
# bboxes:  [[x1,y1,x2,y2], ...] 像素坐标
# scores:  [0.96, 0.89, 0.82] 置信度

推理正常输出证据

NPU 推理输出

$ python inference.py --image test_face_human_hand_detection.jpg --device npu

Loading model from: models/iic/cv_nanodet_face-human-hand-detection/pytorch_model.bin
Device: npu
  Loaded 645/645 parameters from checkpoint
Model loaded successfully!

Warming up...
Warm-up done.

Running inference on: test_face_human_hand_detection.jpg
Inference time: 162.72 ms

============================================================
Detection Results (score > 0.3):
============================================================
  Class                   Score BBox (x1,y1,x2,y2)
  ------------------------------------------------------------
  人体(body)               0.9692 [0.0, 0.1, 368.0, 640.0]
  人脸(face)               0.9165 [127.2, 82.9, 332.5, 366.2]
  人手(hand)               0.8349 [75.5, 286.1, 240.8, 511.5]

Done! Device=npu, Inference=162.72ms, Detections=3
Saved result to: output/result_npu.jpg

CPU 推理输出

$ python inference.py --image test_face_human_hand_detection.jpg --device cpu

Loading model from: models/iic/cv_nanodet_face-human-hand-detection/pytorch_model.bin
Device: cpu
  Loaded 645/645 parameters from checkpoint

Running inference on: test_face_human_hand_detection.jpg
Inference time: 74.19 ms

============================================================
Detection Results (score > 0.3):
============================================================
  Class                   Score BBox (x1,y1,x2,y2)
  ------------------------------------------------------------
  人体(body)               0.9691 [0.0, 0.1, 368.0, 640.0]
  人脸(face)               0.9165 [127.2, 82.9, 332.5, 366.2]
  人手(hand)               0.8350 [75.5, 286.1, 240.8, 511.5]

Done! Device=cpu, Inference=74.19ms, Detections=3
Saved result to: output/result_cpu.jpg

CPU 与 NPU 精度对比输出

$ python inference.py --compare

CPU vs NPU Precision Comparison (feature-level)
============================================================
  cls_score level 0: max_diff=0.013860, mean_diff=0.000810, cos_sim=1.000000
  cls_score level 1: max_diff=0.005308, mean_diff=0.000906, cos_sim=1.000000
  cls_score level 2: max_diff=0.003007, mean_diff=0.000804, cos_sim=1.000000
  cls_score level 3: max_diff=0.001945, mean_diff=0.000782, cos_sim=1.000000
  bbox_pred level 0: max_diff=0.020175, mean_diff=0.001067, cos_sim=0.999999
  bbox_pred level 1: max_diff=0.014530, mean_diff=0.001104, cos_sim=1.000000
  bbox_pred level 2: max_diff=0.008117, mean_diff=0.001090, cos_sim=1.000000
  bbox_pred level 3: max_diff=0.002050, mean_diff=0.000483, cos_sim=1.000000

  Matched detections: 3
  Score difference: mean=0.000054, max=0.000059

性能基准测试输出

$ python inference.py --benchmark --bench-iters 30

============================================================
Performance Benchmark: device=cpu, input_size=320, iters=30
============================================================
  FPS: 21.0

============================================================
Performance Benchmark: device=npu:0, input_size=320, iters=30
============================================================
  FPS: 95.8

推理可视化结果

NPU 推理结果：

NPU推理结果

CPU 推理结果：

CPU推理结果

精度验证报告

CPU vs NPU 特征级对比

特征层	最大差异	平均差异	余弦相似度
cls_score level 0	0.0139	0.0008	1.000000
cls_score level 1	0.0053	0.0009	1.000000
cls_score level 2	0.0030	0.0008	1.000000
cls_score level 3	0.0019	0.0008	1.000000
bbox_pred level 0	0.0202	0.0011	0.999999
bbox_pred level 1	0.0145	0.0011	1.000000
bbox_pred level 2	0.0081	0.0011	1.000000
bbox_pred level 3	0.0021	0.0005	1.000000

CPU vs NPU 检测级对比

检测	类别匹配	分数差异	边界框差异 (像素)
Det 0 (人体)	✓	0.000059	[0.00, 0.00, 0.00, 0.00]
Det 1 (人脸)	✓	0.000049	[0.00, 0.00, 0.01, 0.00]
Det 2 (人手)	✓	0.000054	[0.01, 0.00, 0.00, 0.00]

结论: CPU 与 NPU 推理结果高度一致，特征余弦相似度 ≥ 0.999999，检测级最大分数差异 < 0.0001，边界框差异 < 0.01 像素。

与 ModelScope 基线对比

检测	类别	ModelScope 分数	Ours 分数	IoU
人体	body	0.9679	0.9691	>0.99
人脸	face	0.8987	0.9165	0.97
人手	hand	0.8202	0.8350	>0.90

分数差异在 2% 以内（来自 Integral softmax 数值精度差异），IoU > 0.90，精度合格。

性能基准

设备	平均延迟	标准差	FPS	加速比
CPU	47.21 ms	0.24 ms	21.2	1.0×
NPU (Ascend 910B)	10.43 ms	0.11 ms	95.9	4.53×

测试条件: batch_size=1, input_size=320×320, 100 次迭代取均值, warmup=5

项目结构

nanodet-face-human-hand/
├── inference.py          # 推理脚本（模型定义 + 推理 + 评估 + 基准测试）
├── README.md             # 部署文档
├── models/               # 模型权重
│   └── iic/cv_nanodet_face-human-hand-detection/
│       ├── pytorch_model.bin
│       └── test_face_human_hand_detection.jpg
└── output/               # 推理输出
    ├── result_cpu.jpg
    ├── result_npu.jpg
    ├── precision_comparison.json
    ├── precision_detail.json
    └── benchmark.json

模型架构

Input (3×320×320, BGR)
  │
  ▼
ShuffleNetV2-1.0x Backbone
  ├─ stage2 → C2 (116, 80×80)
  ├─ stage3 → C3 (232, 40×40)
  └─ stage4 → C4 (464, 20×20)
        │
        ▼
GhostPAN Neck
  ├─ P4 (96, 20×20)  ← top-down from C4
  ├─ P3 (96, 40×40)  ← top-down from C3 + P4
  └─ P2 (96, 80×80)  ← top-down from C2 + P3
        │
        ▼
NanoDetPlus Head (GFL)
  ├─ cls_score: 4 levels × num_classes
  └─ bbox_pred:  4 levels × 4×(reg_max+1)
        │
        ▼
Post-processing
  ├─ Integral (softmax → weighted sum → distance)
  ├─ distance2bbox (anchor + distance → xyxy)
  ├─ Per-class score filtering (thresh=0.3)
  └─ NMS (IoU thresh=0.5)

关键适配说明

GhostPAN 前向传播修正：原始 ModelScope 实现中 feat_low 取自 inner_outs（前一步融合输出）而非原始 inputs，已严格对齐。
Integral 解码：GFL 边界框预测通过 softmax → 重塑为 (..., 4, reg_max+1) → 矩阵乘法得到距离值，而非简单求和。
预处理归一化：ModelScope 使用 (img/255 - mean/255) / (std/255)，等价于 (img - mean) / std，mean/std 为 ImageNet BGR 值。
Anchor 中心点：使用 stride × (grid + 0.5) 构造锚框中心，而非网格角落。
NPU 设备适配：所有张量自动 .to(device)，NPU 推理后 .cpu() 转回进行后处理。

License

本模型遵循 ModelScope 原始许可协议。