MobileViT-X-Small on Ascend NPU

1. 简介

本文档记录 apple/mobilevit-x-small 在华为昇腾 Ascend 910B NPU 上的适配、部署与验证结果。

MobileViT（Mobile Vision Transformer）是 Apple 提出的轻量级视觉 Transformer 模型，结合了 MobileNetV2 风格的高效卷积层与 Transformer 的全局处理能力。MobileViT 无需位置嵌入，可直接嵌入 CNN 中，兼顾移动设备的高效部署与视觉任务的强大表现。

MobileViT-X-Small 是 MobileViT 系列的中间尺寸版本，在 ImageNet-1k 上预训练（分辨率 256×256），参数量约 2.3M，ImageNet Top-1 精度约 74.8%。

适配要点：

使用 torch_npu 将 PyTorch 模型迁移至 Ascend NPU
利用 transfer_to_npu 自动完成 CUDA 到 NPU 的 API 映射
验证了 CPU 与 NPU 的数值一致性（logits 余弦相似度 > 0.999）

2. 验证环境

组件	版本
`CANN`	`8.5.1`
`torch`	`2.5.1`
`torch-npu`	`2.5.1.dev20260320`
`transformers`	`4.47.1`
`Pillow`	`10.4.0`

NPU：Ascend 910B4（1 卡，32GB HBM）
操作系统：Linux 5.10.0 aarch64

3. 快速开始

3.1 环境准备

pip install torch transformers pillow requests
# 确保 CANN 和 torch_npu 已正确安装

3.2 下载模型

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download apple/mobilevit-x-small \
  --local-dir ./model --local-dir-use-symlinks False

3.3 运行推理

# 使用单张图片进行推理
python inference.py \
  --model_path ./model \
  --image /path/to/image.jpg \
  --top_k 5

# 运行性能基准测试
python inference.py \
  --model_path ./model \
  --benchmark \
  --iterations 50 \
  --warmup 10

参数说明：

参数	说明	默认值
`--model_path`	模型权重路径	`./model`
`--image`	输入图片路径	`None`
`--top_k`	Top-K 预测结果展示数量	`5`
`--benchmark`	启用性能测试模式	`False`
`--iterations`	性能测试迭代次数	`10`
`--warmup`	预热迭代次数	`3`

4. 验证结果

4.1 精度验证

验证方法：将 COCO 数据集验证图片输入模型，分别使用 CPU 和 NPU 推理，对比输出 logits 的余弦相似度和 Top-1 分类标签的一致性。

运行命令：

python accuracy_run.py ./model accuracy_report.json

验证结果：

指标	数值
Logits 余弦相似度	0.999998
Top-1 标签一致性	完全一致（class_281: tabby cat）
最大概率差异	0.000768
平均相对误差	0.010909
总体	PASS

NPU 与 CPU 的输出 logits 余弦相似度达 0.999998，Top-1 预测结果完全一致，最大概率差异仅 0.000768（约 0.08%），远低于 1% 的阈值。MobileViT-X-Small 在 NPU 上的推理精度与 CPU 高度一致。

4.2 性能验证

运行命令：

python accuracy_run_perf.py ./model 50 perf_report.json

NPU 性能结果（50 次迭代，warmup 10 次）：

指标	数值
平均延迟	17.59 ms
P50 延迟	17.59 ms
P90 延迟	17.74 ms
最小延迟	17.30 ms
最大延迟	18.60 ms
吞吐量	56.84 images/s

MobileViT-X-Small 在 Ascend 910B NPU 上单次推理仅需约 17.59ms，吞吐量达 56.84 张图片/秒。与 MobileViT-Small 相比，虽然参数量减少约 60%（5.6M → 2.3M），但由于计算瓶颈位于 MobileViT 模块的 Transformer 层而非卷积通道，推理延迟基本持平。

5. 模型信息

属性	值
模型架构	MobileViTForImageClassification
参数量	2.3M
输入尺寸	256×256（中心裁剪），288×288（缩放）
预训练数据集	ImageNet-1k
Top-1 精度	74.8%
Top-5 精度	92.3%
隐藏层维度	[96, 120, 144]
激活函数	SiLU (Swish)
输出类别数	1000（ImageNet）

6. 项目结构

.
├── model/                      # 模型权重
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── preprocessor_config.json
│   └── ...
├── inference.py                # NPU 推理脚本
├── accuracy_run.py             # 精度验证脚本
├── accuracy_run_perf.py        # 性能基准测试脚本
├── accuracy_report.json        # 精度验证报告
├── perf_report.json            # 性能测试报告
└── readme.md                   # 本文档

7. 注意事项

精度验证：MobileViT 为确定性模型（无随机采样组件），CPU 与 NPU 输出具有高度一致性。较大的相对误差仅出现在接近零的 logit 值中，对 Top-1 分类无影响。
NPU 初始化：transfer_to_npu 会自动替换 torch.cuda.* 为 torch.npu.*，首次 import 会有警告，属正常现象。
图片预处理：模型期望 BGR 像素顺序（非 RGB），MobileViTImageProcessor 会自动处理通道翻转。
输入尺寸：预处理会将输入图片缩放到 288×288 后中心裁剪至 256×256。

8. 引用

@inproceedings{vision-transformer,
  title = {MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
  author = {Sachin Mehta and Mohammad Rastegari},
  year = {2022},
  url = {https://arxiv.org/abs/2110.02178}
}

适配方：Ascend-SACT 标签：#NPU #Ascend #MobileViT #ImageClassification #Vision #PyTorch

精度结论

基于现有评测数据，CPU 与 NPU 的余弦相似度精度误差为 0.0002%，小于 1% 的精度要求。

推理成功证据

本仓库提供完整的推理脚本，支持 CPU 和 NPU 双平台推理：

# NPU 推理
python3 inference.py --device npu

# CPU 推理
python3 inference.py --device cpu

推理完成后会输出推理结果和耗时，表明模型在 NPU 上推理成功。

MobileViT-X-Small on Ascend NPU

1. 简介

本文档记录 apple/mobilevit-x-small 在华为昇腾 Ascend 910B NPU 上的适配、部署与验证结果。

MobileViT-X-Small 是 MobileViT 系列的中间尺寸版本，在 ImageNet-1k 上预训练（分辨率 256×256），参数量约 2.3M，ImageNet Top-1 精度约 74.8%。

适配要点：

使用 torch_npu 将 PyTorch 模型迁移至 Ascend NPU
利用 transfer_to_npu 自动完成 CUDA 到 NPU 的 API 映射
验证了 CPU 与 NPU 的数值一致性（logits 余弦相似度 > 0.999）

2. 验证环境

组件	版本
`CANN`	`8.5.1`
`torch`	`2.5.1`
`torch-npu`	`2.5.1.dev20260320`
`transformers`	`4.47.1`
`Pillow`	`10.4.0`

NPU：Ascend 910B4（1 卡，32GB HBM）
操作系统：Linux 5.10.0 aarch64

3. 快速开始

3.1 环境准备

pip install torch transformers pillow requests
# 确保 CANN 和 torch_npu 已正确安装

3.2 下载模型

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download apple/mobilevit-x-small \
  --local-dir ./model --local-dir-use-symlinks False

3.3 运行推理

# 使用单张图片进行推理
python inference.py \
  --model_path ./model \
  --image /path/to/image.jpg \
  --top_k 5

# 运行性能基准测试
python inference.py \
  --model_path ./model \
  --benchmark \
  --iterations 50 \
  --warmup 10

参数说明：

参数	说明	默认值
`--model_path`	模型权重路径	`./model`
`--image`	输入图片路径	`None`
`--top_k`	Top-K 预测结果展示数量	`5`
`--benchmark`	启用性能测试模式	`False`
`--iterations`	性能测试迭代次数	`10`
`--warmup`	预热迭代次数	`3`

4. 验证结果

4.1 精度验证

验证方法：将 COCO 数据集验证图片输入模型，分别使用 CPU 和 NPU 推理，对比输出 logits 的余弦相似度和 Top-1 分类标签的一致性。

运行命令：

python accuracy_run.py ./model accuracy_report.json

验证结果：

指标	数值
Logits 余弦相似度	0.999998
Top-1 标签一致性	完全一致（class_281: tabby cat）
最大概率差异	0.000768
平均相对误差	0.010909
总体	PASS

NPU 与 CPU 的输出 logits 余弦相似度达 0.999998，Top-1 预测结果完全一致，最大概率差异仅 0.000768（约 0.08%），远低于 1% 的阈值。MobileViT-X-Small 在 NPU 上的推理精度与 CPU 高度一致。

4.2 性能验证

运行命令：

python accuracy_run_perf.py ./model 50 perf_report.json

NPU 性能结果（50 次迭代，warmup 10 次）：

指标	数值
平均延迟	17.59 ms
P50 延迟	17.59 ms
P90 延迟	17.74 ms
最小延迟	17.30 ms
最大延迟	18.60 ms
吞吐量	56.84 images/s

MobileViT-X-Small 在 Ascend 910B NPU 上单次推理仅需约 17.59ms，吞吐量达 56.84 张图片/秒。与 MobileViT-Small 相比，虽然参数量减少约 60%（5.6M → 2.3M），但由于计算瓶颈位于 MobileViT 模块的 Transformer 层而非卷积通道，推理延迟基本持平。

5. 模型信息

属性	值
模型架构	MobileViTForImageClassification
参数量	2.3M
输入尺寸	256×256（中心裁剪），288×288（缩放）
预训练数据集	ImageNet-1k
Top-1 精度	74.8%
Top-5 精度	92.3%
隐藏层维度	[96, 120, 144]
激活函数	SiLU (Swish)
输出类别数	1000（ImageNet）

6. 项目结构

.
├── model/                      # 模型权重
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── preprocessor_config.json
│   └── ...
├── inference.py                # NPU 推理脚本
├── accuracy_run.py             # 精度验证脚本
├── accuracy_run_perf.py        # 性能基准测试脚本
├── accuracy_report.json        # 精度验证报告
├── perf_report.json            # 性能测试报告
└── readme.md                   # 本文档

7. 注意事项

精度验证：MobileViT 为确定性模型（无随机采样组件），CPU 与 NPU 输出具有高度一致性。较大的相对误差仅出现在接近零的 logit 值中，对 Top-1 分类无影响。
NPU 初始化：transfer_to_npu 会自动替换 torch.cuda.* 为 torch.npu.*，首次 import 会有警告，属正常现象。
图片预处理：模型期望 BGR 像素顺序（非 RGB），MobileViTImageProcessor 会自动处理通道翻转。
输入尺寸：预处理会将输入图片缩放到 288×288 后中心裁剪至 256×256。

8. 引用

@inproceedings{vision-transformer,
  title = {MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
  author = {Sachin Mehta and Mohammad Rastegari},
  year = {2022},
  url = {https://arxiv.org/abs/2110.02178}
}

适配方：Ascend-SACT 标签：#NPU #Ascend #MobileViT #ImageClassification #Vision #PyTorch

精度结论

基于现有评测数据，CPU 与 NPU 的余弦相似度精度误差为 0.0002%，小于 1% 的精度要求。

推理成功证据

本仓库提供完整的推理脚本，支持 CPU 和 NPU 双平台推理：

# NPU 推理
python3 inference.py --device npu

# CPU 推理
python3 inference.py --device cpu

推理完成后会输出推理结果和耗时，表明模型在 NPU 上推理成功。