MobileViT-Small on Ascend NPU

1. 简介

本文档记录 apple/mobilevit-small 在华为昇腾 Ascend 910B NPU 上的适配、部署与验证结果。

MobileViT（Mobile Vision Transformer）是 Apple 提出的轻量级视觉 Transformer 模型，结合了 MobileNetV2 风格的高效卷积层与 Transformer 的全局处理能力。MobileViT 无需位置嵌入，可直接嵌入 CNN 中，兼顾移动设备的高效部署与视觉任务的强大表现。

MobileViT-Small 在 ImageNet-1k 上预训练（分辨率 256×256），参数量约 5.6M，ImageNet Top-1 精度约 78.4%。

适配要点：

使用 torch_npu 将 PyTorch 模型迁移至 Ascend NPU
利用 transfer_to_npu 自动完成 CUDA 到 NPU 的 API 映射
验证了 CPU 与 NPU 的数值一致性（logits 余弦相似度 > 0.999）

2. 验证环境

组件	版本
`CANN`	`8.5.1`
`torch`	`2.5.1`
`torch-npu`	`2.5.1.dev20260320`
`transformers`	`4.47.1`
`Pillow`	`10.4.0`

NPU：Ascend 910B4（1 卡，32GB HBM）
操作系统：Linux 5.10.0 aarch64

3. 快速开始

3.1 环境准备

pip install torch transformers pillow requests
# 确保 CANN 和 torch_npu 已正确安装

3.2 下载模型

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download apple/mobilevit-small \
  --local-dir ./model --local-dir-use-symlinks False

3.3 运行推理

# 使用单张图片进行推理
python inference.py \
  --model_path ./model \
  --image /path/to/image.jpg \
  --top_k 5

# 运行性能基准测试
python inference.py \
  --model_path ./model \
  --benchmark \
  --iterations 50 \
  --warmup 10

参数说明：

参数	说明	默认值
`--model_path`	模型权重路径	`./model`
`--image`	输入图片路径	`None`
`--top_k`	Top-K 预测结果展示数量	`5`
`--benchmark`	启用性能测试模式	`False`
`--iterations`	性能测试迭代次数	`10`
`--warmup`	预热迭代次数	`3`

4. 验证结果

4.1 精度验证

验证方法：将 COCO 数据集验证图片输入模型，分别使用 CPU 和 NPU 推理，对比输出 logits 的余弦相似度和 Top-1 分类标签的一致性。

运行命令：

python accuracy_run.py ./model accuracy_report.json

验证结果：

指标	数值
Logits 余弦相似度	0.999997
Top-1 标签一致性	完全一致（class_281: tabby cat）
最大概率差异	0.000492
平均相对误差	0.018505
总体	PASS

NPU 与 CPU 的输出 logits 余弦相似度达 0.999997，Top-1 预测结果完全一致，最大概率差异仅 0.000492（约 0.05%），远低于 1% 的阈值。个别 logit 位的相对误差较大是由于 CPU 输出值接近零导致的数值放大效应，不影响实际分类结果。

4.2 性能验证

运行命令：

python accuracy_run_perf.py ./model 50 perf_report.json

NPU 性能结果（50 次迭代，warmup 10 次）：

指标	数值
平均延迟	17.62 ms
P50 延迟	17.55 ms
P90 延迟	18.16 ms
最小延迟	16.93 ms
最大延迟	18.62 ms
吞吐量	56.77 images/s

MobileViT-Small 在 Ascend 910B NPU 上单次推理仅需约 17.62ms，吞吐量达 56.77 张图片/秒，满足实时图像分类场景需求。

5. 模型信息

属性	值
模型架构	MobileViTForImageClassification
参数量	5.6M
输入尺寸	256×256（中心裁剪），288×288（缩放）
预训练数据集	ImageNet-1k
Top-1 精度	78.4%
Top-5 精度	94.1%
隐藏层维度	[144, 192, 240]
激活函数	SiLU (Swish)
输出类别数	1000（ImageNet）

6. 项目结构

.
├── model/                      # 模型权重
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── preprocessor_config.json
│   └── ...
├── inference.py                # NPU 推理脚本
├── accuracy_run.py             # 精度验证脚本
├── accuracy_run_perf.py        # 性能基准测试脚本
├── accuracy_report.json        # 精度验证报告
├── perf_report.json            # 性能测试报告
└── readme.md                   # 本文档

7. 注意事项

精度验证：MobileViT 为确定性模型（无随机采样组件），CPU 与 NPU 输出具有高度一致性。较大的相对误差仅出现在接近零的 logit 值中，对 Top-1 分类无影响。
NPU 初始化：transfer_to_npu 会自动替换 torch.cuda.* 为 torch.npu.*，首次 import 会有警告，属正常现象。
图片预处理：模型期望 BGR 像素顺序（非 RGB），MobileViTImageProcessor 会自动处理通道翻转。
输入尺寸：预处理会将输入图片缩放到 288×288 后中心裁剪至 256×256。

8. 引用

@inproceedings{vision-transformer,
  title = {MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
  author = {Sachin Mehta and Mohammad Rastegari},
  year = {2022},
  url = {https://arxiv.org/abs/2110.02178}
}

适配方：Ascend-SACT 标签：#NPU #Ascend #MobileViT #ImageClassification #Vision #PyTorch

精度结论

基于现有评测数据，CPU 与 NPU 的余弦相似度精度误差为 0.0003%，小于 1% 的精度要求。

推理成功证据

本仓库提供完整的推理脚本，支持 CPU 和 NPU 双平台推理：

# NPU 推理
python3 inference.py --device npu

# CPU 推理
python3 inference.py --device cpu

推理完成后会输出推理结果和耗时，表明模型在 NPU 上推理成功。

MobileViT-Small on Ascend NPU

1. 简介

本文档记录 apple/mobilevit-small 在华为昇腾 Ascend 910B NPU 上的适配、部署与验证结果。

MobileViT-Small 在 ImageNet-1k 上预训练（分辨率 256×256），参数量约 5.6M，ImageNet Top-1 精度约 78.4%。

适配要点：

使用 torch_npu 将 PyTorch 模型迁移至 Ascend NPU
利用 transfer_to_npu 自动完成 CUDA 到 NPU 的 API 映射
验证了 CPU 与 NPU 的数值一致性（logits 余弦相似度 > 0.999）

2. 验证环境

组件	版本
`CANN`	`8.5.1`
`torch`	`2.5.1`
`torch-npu`	`2.5.1.dev20260320`
`transformers`	`4.47.1`
`Pillow`	`10.4.0`

NPU：Ascend 910B4（1 卡，32GB HBM）
操作系统：Linux 5.10.0 aarch64

3. 快速开始

3.1 环境准备

pip install torch transformers pillow requests
# 确保 CANN 和 torch_npu 已正确安装

3.2 下载模型

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download apple/mobilevit-small \
  --local-dir ./model --local-dir-use-symlinks False

3.3 运行推理

# 使用单张图片进行推理
python inference.py \
  --model_path ./model \
  --image /path/to/image.jpg \
  --top_k 5

# 运行性能基准测试
python inference.py \
  --model_path ./model \
  --benchmark \
  --iterations 50 \
  --warmup 10

参数说明：

参数	说明	默认值
`--model_path`	模型权重路径	`./model`
`--image`	输入图片路径	`None`
`--top_k`	Top-K 预测结果展示数量	`5`
`--benchmark`	启用性能测试模式	`False`
`--iterations`	性能测试迭代次数	`10`
`--warmup`	预热迭代次数	`3`

4. 验证结果

4.1 精度验证

验证方法：将 COCO 数据集验证图片输入模型，分别使用 CPU 和 NPU 推理，对比输出 logits 的余弦相似度和 Top-1 分类标签的一致性。

运行命令：

python accuracy_run.py ./model accuracy_report.json

验证结果：

指标	数值
Logits 余弦相似度	0.999997
Top-1 标签一致性	完全一致（class_281: tabby cat）
最大概率差异	0.000492
平均相对误差	0.018505
总体	PASS

NPU 与 CPU 的输出 logits 余弦相似度达 0.999997，Top-1 预测结果完全一致，最大概率差异仅 0.000492（约 0.05%），远低于 1% 的阈值。个别 logit 位的相对误差较大是由于 CPU 输出值接近零导致的数值放大效应，不影响实际分类结果。

4.2 性能验证

运行命令：

python accuracy_run_perf.py ./model 50 perf_report.json

NPU 性能结果（50 次迭代，warmup 10 次）：

指标	数值
平均延迟	17.62 ms
P50 延迟	17.55 ms
P90 延迟	18.16 ms
最小延迟	16.93 ms
最大延迟	18.62 ms
吞吐量	56.77 images/s

MobileViT-Small 在 Ascend 910B NPU 上单次推理仅需约 17.62ms，吞吐量达 56.77 张图片/秒，满足实时图像分类场景需求。

5. 模型信息

属性	值
模型架构	MobileViTForImageClassification
参数量	5.6M
输入尺寸	256×256（中心裁剪），288×288（缩放）
预训练数据集	ImageNet-1k
Top-1 精度	78.4%
Top-5 精度	94.1%
隐藏层维度	[144, 192, 240]
激活函数	SiLU (Swish)
输出类别数	1000（ImageNet）

6. 项目结构

.
├── model/                      # 模型权重
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── preprocessor_config.json
│   └── ...
├── inference.py                # NPU 推理脚本
├── accuracy_run.py             # 精度验证脚本
├── accuracy_run_perf.py        # 性能基准测试脚本
├── accuracy_report.json        # 精度验证报告
├── perf_report.json            # 性能测试报告
└── readme.md                   # 本文档

7. 注意事项

精度验证：MobileViT 为确定性模型（无随机采样组件），CPU 与 NPU 输出具有高度一致性。较大的相对误差仅出现在接近零的 logit 值中，对 Top-1 分类无影响。
NPU 初始化：transfer_to_npu 会自动替换 torch.cuda.* 为 torch.npu.*，首次 import 会有警告，属正常现象。
图片预处理：模型期望 BGR 像素顺序（非 RGB），MobileViTImageProcessor 会自动处理通道翻转。
输入尺寸：预处理会将输入图片缩放到 288×288 后中心裁剪至 256×256。

8. 引用

@inproceedings{vision-transformer,
  title = {MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
  author = {Sachin Mehta and Mohammad Rastegari},
  year = {2022},
  url = {https://arxiv.org/abs/2110.02178}
}

适配方：Ascend-SACT 标签：#NPU #Ascend #MobileViT #ImageClassification #Vision #PyTorch

精度结论

基于现有评测数据，CPU 与 NPU 的余弦相似度精度误差为 0.0003%，小于 1% 的精度要求。

推理成功证据

本仓库提供完整的推理脚本，支持 CPU 和 NPU 双平台推理：

# NPU 推理
python3 inference.py --device npu

# CPU 推理
python3 inference.py --device cpu

推理完成后会输出推理结果和耗时，表明模型在 NPU 上推理成功。