WebSSL-MAE 300M (ViT-L) - 昇腾 NPU 适配

模型简介

Web-SSL MAE ViT-L (300M) 是一个 3 亿参数的 Vision Transformer (ViT-Large) 模型，它采用掩码自编码器（Masked Autoencoder, MAE）在 20 亿张网络图像上进行自监督训练，整个过程无需语言监督。该模型由 Fan et al., 2025 提出。

属性	值
架构	ViT-Large
参数量	3.04 亿
输入分辨率	224×224
Patch 大小	16×16
层数	24
隐藏维度	1024
注意力头	16
训练数据	MetaCLIP Web 数据 (20 亿样本)

昇腾 NPU 适配

环境要求

组件	版本
Python	3.11
PyTorch	2.9.0+cpu
torch_npu	✅ 已安装
CANN	8.5.1
transformers	✅ 已安装
Ascend NPU	Atlas 800 (单卡)

快速开始

# 设置 HuggingFace 镜像（国内环境）
export HF_ENDPOINT=https://hf-mirror.com

# 单图推理 (NPU)
python3 inference.py --device npu

# 单图推理 (CPU)
python3 inference.py --device cpu

推理脚本

from transformers import AutoImageProcessor, ViTModel
from PIL import Image
import torch
import numpy as np

# 加载处理器和模型
processor = AutoImageProcessor.from_pretrained('facebook/webssl-mae300m-full2b-224')
model = ViTModel.from_pretrained('facebook/webssl-mae300m-full2b-224')

# 移动到 NPU
model = model.to('npu:0').eval()

# 准备图像
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
inputs = {k: v.to('npu:0') for k, v in inputs.items()}

# 推理
with torch.no_grad():
    outputs = model(**inputs)

# 提取特征
features = outputs.last_hidden_state       # [1, 197, 1024]
cls_token = outputs.last_hidden_state[:, 0]  # [CLS] token

精度评测结果

NPU vs CPU 精度对比（8 张合成测试图）：

指标	值
最大 NMSE	0.0158%
平均 NMSE	0.0033%
全隐藏状态 NMSE	0.0029%
CLS Token NMSE	0.0038%
阈值要求	< 1%
结论	✅ PASS（远超要求）

误差来源分析：NPU 与 CPU 之间的微小浮点运算差异（FP32 精度下，不同硬件后端对算子实现有细微差异），所有样本 NMSE 均远低于 1% 阈值。

性能评测结果

设备	Batch	延迟 (ms)	吞吐量 (img/s)
CPU	1	5681.77	-
NPU	1	22.86	43.74
NPU	4	21.77	183.70
NPU	8	26.64	300.26
NPU	16	46.75	342.26

加速比（NPU vs CPU, batch=1）：248.49x 🚀

评测脚本说明

脚本	功能
`inference.py`	单图推理，支持 `--device cpu/npu` 和 `--image` 参数
`accuracy_run.py`	CPU vs NPU 精度对比，计算 NMSE 误差
`accuracy_run_perf.py`	性能基准测试，延迟和吞吐量
`check_accuracy_run_perf.py`	解析日志，判断 PASS/FAIL

运行全部评测：

export HF_ENDPOINT=https://hf-mirror.com

# 1. 精度评测
python3 accuracy_run.py | tee accuracy_log.txt

# 2. 性能评测
python3 accuracy_run_perf.py | tee perf_log.txt

# 3. 合并日志
cat accuracy_log.txt perf_log.txt > log.txt

# 4. 检查结果
python3 check_accuracy_run_perf.py log.txt

文件结构

webssl-mae300m-full2b-224-ascend/
├── readme.md                  # 本文件：部署及评测文档
├── inference.py               # 推理脚本（NPU/CPU）
├── accuracy_run.py            # 精度评测
├── accuracy_run_perf.py       # 性能基准测试
├── check_accuracy_run_perf.py # 结果检查
└── log.txt                    # 评测结果日志

引用

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}