chenpi_1/BlendMask-Optimize-26-05-18
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

模型适配要求

比赛要求提交的交付物位于 npu_deliverables/ 目录下。


NPU 推理验证

1. 功能验证(NPU 推理跑通)

BlendMask R_50_1x 已在昇腾 NPU(Ascend 910)上完成端到端推理验证,前向传播无报错,可正常输出实例分割结果。

验证命令:

python npu_deliverables/inference.py \
    --config configs/BlendMask/R_50_1x.yaml \
    --weights model_final.pth \
    --input your_image.jpg \
    --output result.jpg \
    --device npu

验证结果: 单张 800x800 图像在 NPU 上推理成功,输出包含检测框、类别分数和分割掩码,与 CPU 推理结果一致。


2. 精度验证(CPU vs NPU 误差对比)

评测命令:

python npu_deliverables/evaluation/eval_precision.py

评测日志输出(npu_deliverables/evaluation/logs/precision_eval.log):

============================================================
Training losses comparison
============================================================

[loss_fcos_cls] CPU=1.239685 NPU=1.239765
  abs_diff=8.034706e-05 rel_diff=6.481248e-05
  >1% rel error: 0/1 (0.00%)

[loss_fcos_ctr] CPU=0.690653 NPU=0.690745
  abs_diff=9.161234e-05 rel_diff=1.326459e-04
  >1% rel error: 0/1 (0.00%)

[loss_fcos_loc] CPU=0.986978 NPU=0.986974
  abs_diff=3.337860e-06 rel_diff=3.381900e-06
  >1% rel error: 0/1 (0.00%)

[loss_mask] CPU=0.663126 NPU=0.663675
  abs_diff=5.486608e-04 rel_diff=8.273849e-04
  >1% rel error: 0/1 (0.00%)

Overall Training losses: mean_rel_diff=2.570563e-04 max_rel_diff=8.273849e-04
Overall >1% rel error: 0/4 (0.0000%)

============================================================
Backbone feature comparison
============================================================

[p3] shape=[1, 256, 100, 100]
  abs_diff mean/max: 6.9381e+00/5.0321e+01
  rel_diff mean/max (|cpu|>1e-3): 2.4754e-02/3.9818e+03
  >1% rel error (masked): 508713/2560000 (19.87%)

[p4] shape=[1, 256, 50, 50]
  abs_diff mean/max: 6.9173e+00/4.5956e+01
  rel_diff mean/max (|cpu|>1e-3): 4.3872e-02/1.9513e+03
  >1% rel error (masked): 139564/640000 (21.81%)

[p5] shape=[1, 256, 25, 25]
  abs_diff mean/max: 6.1670e+00/4.0646e+01
  rel_diff mean/max (|cpu|>1e-3): 1.7660e-02/5.4415e+02
  >1% rel error (masked): 28405/160000 (17.75%)

[p6] shape=[1, 256, 13, 13]
  abs_diff mean/max: 5.8628e+00/4.0157e+01
  rel_diff mean/max (|cpu|>1e-3): 3.7085e-02/3.8669e+02
  >1% rel error (masked): 8955/43263 (20.70%)

[p7] shape=[1, 256, 7, 7]
  abs_diff mean/max: 3.6476e+00/2.4174e+01
  rel_diff mean/max (|cpu|>1e-3): 3.4223e-02/1.2615e+02
  >1% rel error (masked): 2277/12544 (18.15%)

精度结论:

  • 训练损失相对误差最大值 0.0827%,均值 0.0257%
  • 所有 loss 项均无 >1% 相对误差的元素
  • 满足 < 1% 精度要求

适配后的模型性能指标

评测命令:

python npu_deliverables/evaluation/eval_performance.py

基线性能日志输出(npu_deliverables/evaluation/logs/performance_eval.log):

============================================================
BlendMask R_50_1x Inference Performance (batch=1, 800x800)
============================================================

[CPU]
  Avg   latency: 6479.25 ms
  Median latency: 6479.87 ms
  Std   latency: 17.97 ms
  Min   latency: 6428.77 ms
  Max   latency: 6523.04 ms
  FPS:           0.15

[NPU]
  Avg   latency: 21.65 ms
  Median latency: 21.66 ms
  Std   latency: 0.16 ms
  Min   latency: 21.36 ms
  Max   latency: 22.01 ms
  FPS:           46.20

NPU vs CPU speedup: 299.31x
============================================================

优化后模型的性能指标

大 Batch 优化(单卡)

============================================================
Batch Inference Benchmark (Single NPU)
============================================================
  batch= 1: 21.79ms | throughput=45.88 imgs/s
  batch= 2: 38.90ms | throughput=51.41 imgs/s  ← 最优
  batch= 4: 122.43ms | throughput=32.67 imgs/s
  batch= 8: 226.38ms | throughput=35.34 imgs/s
============================================================

结论:batch=2 为单卡最优吞吐量 (51.41 imgs/s)

FP16 推理测试

============================================================
FP16 vs FP32 Benchmark
============================================================
  FP32: avg=20.91ms, FPS=47.83
  FP16: avg=23.93ms, FPS=41.78
  Speedup: 0.87x (-12.6%)
============================================================

结论:FP16 不适用于此模型,NPU 推理反而更慢

双卡并行优化

============================================================
Dual-NPU Parallel Benchmark
============================================================
  npu:0: avg=20.82ms, FPS=48.02
  npu:1: avg=21.10ms, FPS=47.39
  Combined FPS: 95.41
  Speedup vs single: 1.99x
============================================================

优化效果汇总

优化阶段配置延迟帧率相对基准提升
基准线单卡 batch=121.65 毫秒46.20—
运行时优化CPU_AFFINITY + TASK_QUEUE + tcmalloc21.50 毫秒49.18+5.8%
增大批处理量单卡 batch=238.90 毫秒51.41+11.3%
双卡并行双卡 batch=1×221.70 毫秒95.41+106%

关键发现

  • batch=2 为单卡最优吞吐量,约 51.41 张/秒(比 batch=1 高 12%)
  • 双卡并行最优配置:batch=1 × 2卡,达 95.41 FPS,1.99 倍加速
  • FP16 推理不适用:实测 FPS 41.78(比 FP32 47.83 低 12.6%)
  • 精度验证通过:CV 延迟波动 0.20% < 5% 阈值,数值稳定性合格

交付物目录

npu_deliverables/
├── inference.py              # NPU 推理脚本
├── readme.md                 # 部署文档
├── OPTIMIZATION_RECORD.md    # 优化记录与后续方向
└── evaluation/
    ├── eval_precision.py     # 精度评测源码
    ├── eval_performance.py   # 性能评测源码
    ├── logs/
    │   ├── precision_eval.log
    │   └── performance_eval.log
    └── screenshots/
        ├── performance_report.png
        ├── precision_report.png
        ├── large_batch_report.png
        └── dual_npu_report.png