m0_74196153/aimv2-large-patch14-224-distilled

Apple AIMv2 Large Patch14 224 Distilled — 昇腾 NPU 适配

1. 简介

本文档记录 apple/aimv2-large-patch14-224-distilled 在 Ascend 910B4 NPU 上的推理适配与验证结果。

AIMv2 是 Apple 提出的视觉编码器系列，采用多模态自回归预训练。224-distilled 为蒸馏版 ViT 模型，参数量约 1.15B。

论文: Multimodal Autoregressive Pre-training of Large Vision Encoders
原始权重: ModelScope / HuggingFace

2. 环境

组件	版本
PyTorch	2.9.0+cpu
torch_npu	2.9.0.post1
Transformers	4.57.6
CANN	8.5.1
NPU	Ascend 910B4

3. 适配要点

权重 Key 转换：原始 safetensors 使用 Timm 风格命名（embeddings.patch_embed、encoder.layers.{i}.attention.q_proj/k_proj/v_proj），通过 convert_weights.py 自动映射为 HuggingFace 格式并合并 q/k/v → qkv
零算子修改：Conv2d / SiLU / RMSNorm / ScaledDotProductAttention 等算子原生支持 NPU
直接加载：因 model_type 字段不匹配，使用本地建模文件直接加载而非 AutoModel

4. 使用方式

# 安装依赖
pip install torch torch_npu transformers safetensors pillow

# 运行推理（随机输入）
python inference.py

# 运行推理（真实图片）
python inference.py --image path/to/image.jpg

# FP16 推理
python inference.py --dtype float16

推理脚本会自动从 /opt/atomgit/.cache/modelscope/hub/models/apple/aimv2-large-patch14-224-distilled 加载权重。可通过环境变量 MODEL_DIR 指定路径。

5. 精度评测

CPU 与 NPU 同输入逐元素对比（FP32）：

seed	max_diff	mean_diff	cos_sim	结果
42	5.22e-03	6.17e-05	1.000015	✅ PASS
123	3.11e-03	6.49e-05	1.000014	✅ PASS
256	3.77e-03	5.76e-05	1.000016	✅ PASS
789	3.73e-03	6.21e-05	1.000016	✅ PASS
1024	2.15e-03	5.31e-05	1.000016	✅ PASS

结论: CPU 与 NPU 输出余弦相似度 ≈ 1.0，精度符合要求（误差 < 1%）。

6. 性能基准

batch_size	avg (ms)	p50 (ms)	p99 (ms)
1	22.9	22.6	26.0
4	23.1	22.9	26.2

测试条件: Ascend 910B4, FP32, 随机输入 224×224, 20 次取平均。

7. 文件说明

├── inference.py                  # 推理入口
├── convert_weights.py            # 权重 key 转换
├── modeling_aimv2_distilled.py   # 模型定义
├── configuration_aimv2_distilled.py  # 模型配置
└── README.md                     # 本文档

8. 引用

@misc{fini2024multimodalautoregressivepretraininglarge,
  author={Fini, Enrico and Shukor, Mustafa and Li, Xiujun and others},
  title={Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year={2024},
  eprint={2411.14402},
  archivePrefix={arXiv},
}