Apple AIMv2 Large Patch14 224 Lit — 昇腾 NPU 适配

1. 简介

本文档记录 apple/aimv2-large-patch14-224-lit 在 Ascend 910B4 NPU 上的推理适配与验证结果。

AIMv2 Lit 为视觉-语言双编码器模型（CLIP-like），包含 Vision Encoder 和 Text Encoder，参数量约 1.63B。

论文: Multimodal Autoregressive Pre-training of Large Vision Encoders
原始权重: ModelScope / HuggingFace

2. 环境

组件	版本
PyTorch	2.9.0+cpu
torch_npu	2.9.0.post1
Transformers	4.57.6
CANN	8.5.1
NPU	Ascend 910B4

3. 适配要点

双编码器结构: Vision Encoder (24层 ViT) + Text Encoder (12层 Transformer)
权重 Key 转换: vision_model. → image_encoder., text_model. → text_encoder., 分离 q/k/v 合并为 qkv
零算子修改: 所有算子原生支持 NPU
投影层: visual_projection / text_projection 权重映射

4. 使用方式

# 安装依赖
pip install torch torch_npu transformers safetensors pillow

# 运行推理
python inference.py

# 指定文本 prompt
python inference.py --text "a photo of a dog"

# 使用真实图片
python inference.py --image path/to/image.jpg --text "a cat sitting on a chair"

# FP16 推理
python inference.py --dtype float16

5. 精度评测

CPU 与 NPU 同输入逐元素对比（FP32 logits_per_image）：

seed	max_diff	mean_diff	cos_sim	结果
42	1.79e-04	1.79e-04	1.000000	✅ PASS
123	2.46e-04	2.46e-04	1.000000	✅ PASS
256	1.58e-04	1.58e-04	1.000000	✅ PASS
789	6.29e-05	6.29e-05	1.000000	✅ PASS
1024	5.91e-05	5.91e-05	1.000000	✅ PASS

结论: CPU 与 NPU 输出余弦相似度 = 1.0，精度完全符合要求。

6. 性能基准

batch_size	avg (ms)	p50 (ms)	p99 (ms)
1	33.9	33.9	34.4
4	34.7	34.7	34.9

测试条件: Ascend 910B4, FP32, 随机输入 224×224, 20 次取平均。

7. 文件说明

├── inference.py                  # 推理入口
├── convert_weights.py            # 权重 key 转换
├── modeling_aimv2_lit.py         # 模型定义 (双编码器)
├── configuration_aimv2_lit.py    # 模型配置
└── README.md                     # 本文档

8. 引用

@misc{fini2024multimodalautoregressivepretraininglarge,
  author={Fini, Enrico and Shukor, Mustafa and Li, Xiujun and others},
  title={Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year={2024},
  eprint={2411.14402},
  archivePrefix={arXiv},
}

1. 简介

本文档记录 apple/aimv2-large-patch14-224-lit 在 Ascend 910B4 NPU 上的推理适配与验证结果。

AIMv2 Lit 为视觉-语言双编码器模型（CLIP-like），包含 Vision Encoder 和 Text Encoder，参数量约 1.63B。

组件

版本

PyTorch

2.9.0+cpu

torch_npu

2.9.0.post1

Transformers

4.57.6

CANN

8.5.1

NPU

Ascend 910B4

3. 适配要点

双编码器结构: Vision Encoder (24层 ViT) + Text Encoder (12层 Transformer)

权重 Key 转换: vision_model. → image_encoder., text_model. → text_encoder., 分离 q/k/v 合并为 qkv

零算子修改: 所有算子原生支持 NPU

投影层: visual_projection / text_projection 权重映射

4. 使用方式

# 安装依赖
pip install torch torch_npu transformers safetensors pillow

# 运行推理
python inference.py

# 指定文本 prompt
python inference.py --text "a photo of a dog"

# 使用真实图片
python inference.py --image path/to/image.jpg --text "a cat sitting on a chair"

# FP16 推理
python inference.py --dtype float16

5. 精度评测

CPU 与 NPU 同输入逐元素对比（FP32 logits_per_image）：

seed	max_diff	mean_diff	cos_sim	结果
42	1.79e-04	1.79e-04	1.000000	✅ PASS
123	2.46e-04	2.46e-04	1.000000	✅ PASS
256	1.58e-04	1.58e-04	1.000000	✅ PASS
789	6.29e-05	6.29e-05	1.000000	✅ PASS
1024	5.91e-05	5.91e-05	1.000000	✅ PASS

结论: CPU 与 NPU 输出余弦相似度 = 1.0，精度完全符合要求。

batch_size

avg (ms)

p50 (ms)

p99 (ms)

33.9

34.4

34.7

34.9

├── inference.py # 推理入口 ├── convert_weights.py # 权重 key 转换 ├── modeling_aimv2_lit.py # 模型定义 (双编码器) ├── configuration_aimv2_lit.py # 模型配置 └── README.md # 本文档

@misc{fini2024multimodalautoregressivepretraininglarge, author={Fini, Enrico and Shukor, Mustafa and Li, Xiujun and others}, title={Multimodal Autoregressive Pre-training of Large Vision Encoders}, year={2024}, eprint={2411.14402}, archivePrefix={arXiv}, }