比赛要求提交的交付物位于 npu_deliverables/ 目录下。
BlendMask R_50_1x 已在昇腾 NPU(Ascend 910)上完成端到端推理验证,前向传播无报错,可正常输出实例分割结果。
验证命令:
python npu_deliverables/inference.py \
--config configs/BlendMask/R_50_1x.yaml \
--weights model_final.pth \
--input your_image.jpg \
--output result.jpg \
--device npu验证结果: 单张 800x800 图像在 NPU 上推理成功,输出包含检测框、类别分数和分割掩码,与 CPU 推理结果一致。
评测命令:
python npu_deliverables/evaluation/eval_precision.py评测日志输出(npu_deliverables/evaluation/logs/precision_eval.log):
============================================================
Training losses comparison
============================================================
[loss_fcos_cls] CPU=1.239685 NPU=1.239765
abs_diff=8.034706e-05 rel_diff=6.481248e-05
>1% rel error: 0/1 (0.00%)
[loss_fcos_ctr] CPU=0.690653 NPU=0.690745
abs_diff=9.161234e-05 rel_diff=1.326459e-04
>1% rel error: 0/1 (0.00%)
[loss_fcos_loc] CPU=0.986978 NPU=0.986974
abs_diff=3.337860e-06 rel_diff=3.381900e-06
>1% rel error: 0/1 (0.00%)
[loss_mask] CPU=0.663126 NPU=0.663675
abs_diff=5.486608e-04 rel_diff=8.273849e-04
>1% rel error: 0/1 (0.00%)
Overall Training losses: mean_rel_diff=2.570563e-04 max_rel_diff=8.273849e-04
Overall >1% rel error: 0/4 (0.0000%)
============================================================
Backbone feature comparison
============================================================
[p3] shape=[1, 256, 100, 100]
abs_diff mean/max: 6.9381e+00/5.0321e+01
rel_diff mean/max (|cpu|>1e-3): 2.4754e-02/3.9818e+03
>1% rel error (masked): 508713/2560000 (19.87%)
[p4] shape=[1, 256, 50, 50]
abs_diff mean/max: 6.9173e+00/4.5956e+01
rel_diff mean/max (|cpu|>1e-3): 4.3872e-02/1.9513e+03
>1% rel error (masked): 139564/640000 (21.81%)
[p5] shape=[1, 256, 25, 25]
abs_diff mean/max: 6.1670e+00/4.0646e+01
rel_diff mean/max (|cpu|>1e-3): 1.7660e-02/5.4415e+02
>1% rel error (masked): 28405/160000 (17.75%)
[p6] shape=[1, 256, 13, 13]
abs_diff mean/max: 5.8628e+00/4.0157e+01
rel_diff mean/max (|cpu|>1e-3): 3.7085e-02/3.8669e+02
>1% rel error (masked): 8955/43263 (20.70%)
[p7] shape=[1, 256, 7, 7]
abs_diff mean/max: 3.6476e+00/2.4174e+01
rel_diff mean/max (|cpu|>1e-3): 3.4223e-02/1.2615e+02
>1% rel error (masked): 2277/12544 (18.15%)精度结论:
评测命令:
python npu_deliverables/evaluation/eval_performance.py基线性能日志输出(npu_deliverables/evaluation/logs/performance_eval.log):
============================================================
BlendMask R_50_1x Inference Performance (batch=1, 800x800)
============================================================
[CPU]
Avg latency: 6479.25 ms
Median latency: 6479.87 ms
Std latency: 17.97 ms
Min latency: 6428.77 ms
Max latency: 6523.04 ms
FPS: 0.15
[NPU]
Avg latency: 21.65 ms
Median latency: 21.66 ms
Std latency: 0.16 ms
Min latency: 21.36 ms
Max latency: 22.01 ms
FPS: 46.20
NPU vs CPU speedup: 299.31x
========================================================================================================================
Batch Inference Benchmark (Single NPU)
============================================================
batch= 1: 21.79ms | throughput=45.88 imgs/s
batch= 2: 38.90ms | throughput=51.41 imgs/s ← 最优
batch= 4: 122.43ms | throughput=32.67 imgs/s
batch= 8: 226.38ms | throughput=35.34 imgs/s
============================================================结论:batch=2 为单卡最优吞吐量 (51.41 imgs/s)
============================================================
FP16 vs FP32 Benchmark
============================================================
FP32: avg=20.91ms, FPS=47.83
FP16: avg=23.93ms, FPS=41.78
Speedup: 0.87x (-12.6%)
============================================================结论:FP16 不适用于此模型,NPU 推理反而更慢
============================================================
Dual-NPU Parallel Benchmark
============================================================
npu:0: avg=20.82ms, FPS=48.02
npu:1: avg=21.10ms, FPS=47.39
Combined FPS: 95.41
Speedup vs single: 1.99x
============================================================| 优化阶段 | 配置 | 延迟 | 帧率 | 相对基准提升 |
|---|---|---|---|---|
| 基准线 | 单卡 batch=1 | 21.65 毫秒 | 46.20 | — |
| 运行时优化 | CPU_AFFINITY + TASK_QUEUE + tcmalloc | 21.50 毫秒 | 49.18 | +5.8% |
| 增大批处理量 | 单卡 batch=2 | 38.90 毫秒 | 51.41 | +11.3% |
| 双卡并行 | 双卡 batch=1×2 | 21.70 毫秒 | 95.41 | +106% |
npu_deliverables/
├── inference.py # NPU 推理脚本
├── readme.md # 部署文档
├── OPTIMIZATION_RECORD.md # 优化记录与后续方向
└── evaluation/
├── eval_precision.py # 精度评测源码
├── eval_performance.py # 性能评测源码
├── logs/
│ ├── precision_eval.log
│ └── performance_eval.log
└── screenshots/
├── performance_report.png
├── precision_report.png
├── large_batch_report.png
└── dual_npu_report.png