BlendMask Ascend NPU 验证

本仓库包含在华为Ascend NPU（Atlas 800 A2/A3，CANN 8.5.1）上运行BlendMask-Lite实例分割模型的完整验证、优化和基准测试工具链。

概述

阶段	状态	描述
精度验证	通过	使用msaccucmp进行CPU/NPU张量级对比
性能基准测试	通过	批处理/输入调优 + msprof性能分析
预处理优化	通过	融合NPU张量操作与PIL基线对比
ATC OM转换	通过	导出ONNX；ATC转换成功，支持BN+FP16融合

硬件与软件环境

NPU：Ascend 910B4
CANN：8.5.1
PyTorch：2.x + torch_npu
操作系统：EulerOS / openEuler（aarch64）

仓库结构

blendmask-ascend-validation/
├── blendmask_model.py           # Lightweight BlendMask-like model (ResNet50-like + FPN + FCOS + Blender)
├── blendmask_validator.py       # Precision validation: CPU vs NPU with DumpCollector + msaccucmp
├── dump_collector.py            # PyTorch hook-based tensor dumper (replaces ptdbg_ascend)
├── perf_benchmark.py            # Performance benchmark: batch tuning, latency, throughput, msprof
├── atc_optimize.sh              # ONNX export + ATC conversion script with fusion & AIPP configs
├── generate_final_report.py     # Aggregates precision + performance + ATC into FINAL_REPORT.md/json
├── reports/
│   ├── precision_report.json        # Detailed precision metrics
│   ├── performance_report.json      # Benchmark results across configs
│   ├── sota_performance_report.json # SOTA benchmark results
│   ├── SOTA_SUMMARY.txt             # Full-text SOTA validation summary
│   ├── FINAL_REPORT.md              # Human-readable consolidated report
│   ├── FINAL_REPORT.json            # Machine-readable consolidated report
│   ├── atc_conversion.log           # ATC conversion log
│   ├── dumps/
│   │   ├── cpu/                     # CPU baseline dump tensors (.npy)
│   │   └── npu/                     # NPU inference dump tensors (.npy)
│   └── msprof_bs8_640x640/          # msprof profiling output
├── models/
│   ├── blendmask_lite.onnx          # Exported ONNX model
│   ├── blendmask_bn_bs1.onnx        # BN static batch ONNX
│   ├── blendmask_bn_dynamic.onnx    # BN dynamic batch ONNX
│   ├── blendmask_bn_bs1.om          # ATC-converted OM model
│   ├── fusion_switch.cfg            # ATC graph fusion configuration
│   └── insert_op.cfg                # ATC AIPP preprocess configuration
├── validation_report.json           # verify-agent scoring schema report
└── README.md                        # This file

注意： 原始规范中引用的以下文件在源目录中不存在，因此未包含在内：blendmask_model_v2.py、blendmask_validator_v2.py、blendmask_validator_v3.py、generate_final_report_v3.py。

快速开始

1. 精度验证

python3 blendmask_validator.py --output-dir ./reports --input-size 640 640

使用以下工具验证 CPU 基准与 NPU 推理之间的数值一致性：

自定义 DumpCollector（PyTorch 前向钩子）
用于张量级比较的 msaccucmp.py file_compare
检测级指标：框误差、分数误差、类别准确率、Top-5 重叠率

2. 性能基准测试

python3 perf_benchmark.py --output-dir ./reports

对多种 (batch_size, input_h, input_w) 配置进行基准测试，并在最佳配置上运行 msprof。

3. ATC 模型转换

bash atc_optimize.sh

从 PyTorch 导出 ONNX 并运行 ATC，具体包含：

图融合（ConvBNFusionPass、UBFusion 等）
buffer_optimize=l2_optimize
AIPP 静态预处理

4. 生成最终报告

python3 generate_final_report.py

将 precision_report.json 和 performance_report.json 汇总为 FINAL_REPORT.md 与 FINAL_REPORT.json。

关键结果

精度

指标	数值	阈值	状态
张量平均相对误差	1.7264%	< 10%	✅ 通过
张量最大层误差	7.2268%	< 10%	✅ 通过
边界框绝对误差	0.081620	< 0.10	✅ 通过
分数相对误差	0.0237%	< 1%	✅ 通过
类别准确率	85.00%	> 70%	✅ 通过
Top-5 类别重叠率	100%	—	✅ 通过

性能（当前最佳：BN+FP16，批大小=8，640×640）

配置	吞吐量（样本/秒）	延迟（毫秒）	加速比
基准 GN FP32	241.27	4.145	1.00×
GN + FP16	318.73	3.137	1.32×
BN + FP32（CANN 融合）	252.95	3.953	1.05×
BN + FP16（旧最佳）	375.59	2.662	1.56×
BN + FP16 + 流（当前最佳）	490.55	2.039	2.03×

批大小扫描（BN+FP16，流并行 FPN 头部）

批大小	串行（样本/秒）	流并行（样本/秒）	加速比	最佳配置
1	121.87	101.50	0.83×	串行
2	215.47	234.45	1.09×	并行
4	315.92	396.91	1.26×	并行
8	375.36	490.55	1.31×	并行 ← 默认
16	420.79	503.59	1.20×	并行

ONNX / ATC

项目	状态
ONNX 静态导出	✅ 正常 (`blendmask_bn_bs1.onnx`)
ONNX 动态导出	✅ 正常 (`blendmask_bn_dynamic.onnx`)
ATC OM 转换	✅ 成功 (`blendmask_bn_bs1.om`)

当前最佳优化进展

从基准到当前最佳的性能提升历程：

轮次	变更	吞吐量（样本/秒）	增益
基准	GN FP32	241.27	—
R1	GN → BN（CANN 卷积+BN 融合）	252.95	+4.8%
R2	FP32 → FP16（自动混合精度）	318.73	+32.1%
R3	GN+FP16 → BN+FP16	375.59	+55.7%
R4	环境优化（tcmalloc，流队列）	375.48	+55.6%（微小）
R5	FPN 头部流并行	490.55	+103%

关键突破：流并行 FPN 头部 5 个 FPN 层级（P3-P7）在独立的 NPU 流上处理，使 AI 核心能够并行而非串行计算。这在批大小=8 时带来 +30.7% 的吞吐量提升，在批大小=16 时提升 +19.7%。

实现当前最佳的关键模型变更：

无原地操作 — ReLU inplace=False 以确保 NPU 兼容性
GN/BN 可切换 — GN 用于精度验证，BN 用于 CANN 融合性能
FP16 自动混合精度支持 — 使用 torch.npu.amp.autocast 进行混合精度推理
BN 配合 bias=False — 启用 CANN 卷积+BN 融合优化
安全插值 — 模式感知的 align_corners 确保 ONNX 导出稳定性
NPU 流并行 — forward_with_streams() 实现 FPN 头部并发处理

工程笔记

GroupNorm 与 BatchNorm：在精度验证阶段使用 GroupNorm，以提升 CPU/NPU 数值一致性（平均相对误差从约 540% 降低至约 5%）。ONNX 导出/ATC 转换时使用 BatchNorm。
ptdbg_ascend 替代方案：由于当前环境中无法使用 ptdbg_ascend，dump_collector.py 提供了一个轻量级的基于钩子的替代方案，完全兼容 msaccucmp。
ATC 转换：静态批处理 BN 变体（blendmask_bn_bs1.onnx）通过 ATC 转换成功，转换过程中启用了图融合（ConvBNFusionPass、UBFusion）和 buffer_optimize=l2_optimize。动态批处理 ONNX 变体需要针对不支持的插值模式进行 AOE 调优。

生产环境后续步骤

将官方 BlendMask 预训练权重加载到 BlendMaskLite 中。
重新运行验证器；预计使用预训练激活值后，张量误差将降至 <1%。
在 ATC 转换前，使用 AOE（自动调优引擎）编译不支持的动态插值算子。
部署带有 aclmdlExecute 流水线的 OM，用于生产环境推理。

许可证

Apache-2.0