timm/vit_small_patch14_reg4_dinov2.lvd142m on Ascend NPU

1. 简介

本文档记录 timm/vit_small_patch14_reg4_dinov2.lvd142m 在 Ascend NPU 环境的推理适配与验证结果。该模型为 TIMM (PyTorch Image Models) 库收录的 DINOv2 ViT-S/14 变体，基于 timm 库直接加载推理，通过 runtime monkey-patch 方式将 torch.cuda 调用透明转发至 torch.npu，无需修改原始模型代码即可完成 NPU 适配。

与 DINOv3 系列不同，该模型属于 DINOv2 架构，训练数据为 LVD-142M，输入尺寸为 518x518，patch size 为 14，并包含 4 个 register tokens。

2. 权重下载方式

推荐使用以下方式下载模型权重：

AtomGit 下载

python3 -m atomgit download hf_mirrors/timm/vit_small_patch14_reg4_dinov2.lvd142m -d /opt/atomgit/weight/vit_small_patch14_reg4_dinov2.lvd142m

下载完成后，目录下应包含 pytorch_model.bin（或 model.safetensors）及 config.json 等文件。

3. 环境依赖安装

在运行验证脚本前，请确保已安装以下依赖：

pip install torch==2.9.0+cpu timm
pip install torch-npu==2.9.0.post1

torch 与 torch-npu 版本需与当前 CANN 版本匹配
timm 用于加载 TIMM 模型与图像预处理

4. 验证环境

组件	版本
`Python`	`3.11.14`
`PyTorch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`timm`	`1.0.27`
`CANN`	`8.0.RC2`

NPU：2 逻辑卡（Ascend910）
模型路径：/opt/atomgit/weight/vit_small_patch14_reg4_dinov2.lvd142m

5. 推理验证

运行单图特征提取推理，验证模型在 NPU 上的基础前向传播能力。

推理使用的样本图片：

sample_image

python3 inference.py

验证输出：

[INFO] Applied NPU monkey-patch (torch.cuda -> torch.npu)
[INFO] Loading image from: /opt/atomgit/vit_small_patch14_reg4_dinov2.lvd142m/data/val2017/000000000139.jpg
[INFO] Loading model: vit_small_patch14_reg4_dinov2.lvd142m
[INFO] Model moved to NPU device: Ascend910_9362
[INFO] Input shape: torch.Size([1, 3, 518, 518])
[INFO] Device: npu:0
[INFO] Warm-up inference...
[INFO] Running timed inference...

============================================================
Inference Results
============================================================
Device:        NPU (Ascend910_9362)
Latency:       10.75 ms
Pooled output shape:   torch.Size([1, 384])
Features shape:        torch.Size([1, 1374, 384])
Pooled output dtype:   torch.float32
Pooled output device:  npu:0
Pooled output first 5 values: [ 1.8751242 -0.5034357  1.8468268 -1.303242   1.1003127]
============================================================

验证结论：

模型成功加载至 NPU 设备
单张 518x518 图像推理延迟约 10.75 ms
输出特征维度符合预期（pooled [1, 384]，features [1, 1374, 384]）
1374 个 token 对应 1 cls + 4 registers + 1369 patches（518/14 = 37，37x37 = 1369）

6. 性能评测

使用 benchmark.py 对 NPU 推理进行延迟与吞吐量压测：

python3 benchmark.py

6.1 延迟测试结果

指标	数值
`iterations`	`20`
`mean_ms`	`5.23 ms`
`stddev_ms`	`0.05 ms`
`min_ms`	`5.13 ms`
`max_ms`	`5.31 ms`
`p50_ms`	`5.24 ms`
`p90_ms`	`5.30 ms`
`p99_ms`	`5.31 ms`

6.2 吞吐量测试结果

batch_size	throughput_ips	avg_latency_ms
`1`	`194.24`	`5.15`
`2`	`251.98`	`7.94`
`4`	`299.22`	`13.37`

验证结论：

稳定态单图平均延迟约 ~5.2 ms，对于 518x518 大分辨率输入表现优秀
Batch=4 时吞吐达到 299.22 images/sec
由于输入分辨率较大，batch_size 建议不超过 4，避免 NPU 内存溢出

7. 精度评测

使用 accuracy.py 对 NPU 输出与 CPU 基线进行精度对比，评估指标为向量级相对误差与余弦相似度：

python3 accuracy.py

验证输出：

============================================================
Accuracy Validation Results
============================================================

pooler_output:
  Shape:                  (1, 384)
  Vector Relative Error:  0.007823 (PASS)
  Cosine Similarity:      0.999970 (PASS)
  MSE:                    0.0000913260
  Max Absolute Diff:      0.025931
  Overall:                PASS

features:
  Shape:                  (1, 1374, 384)
  Vector Relative Error:  0.004343 (PASS)
  Cosine Similarity:      0.999990 (PASS)
  MSE:                    0.0000310854
  Max Absolute Diff:      0.060189
  Overall:                PASS

============================================================
OVERALL: PASS (vector-level relative error < 1% and cosine similarity > 0.999)
============================================================

指标	数值	阈值	结果
Pooled Output 向量相对误差	`0.78%`	`< 1%`	PASS
Pooled Output 余弦相似度	`0.999970`	`> 0.999`	PASS
Features 向量相对误差	`0.43%`	`< 1%`	PASS
Features 余弦相似度	`0.999990`	`> 0.999`	PASS

验证结论：

NPU 输出与 CPU 基线精度误差均 < 1%
余弦相似度均 > 0.999，特征空间一致性高
整体精度验证通过

8. 注意事项

Monkey-patch 适配方式：当前采用运行时 torch.cuda -> torch.npu monkey-patch 实现适配，无需修改 timm 源码。
TORCH_COMPILE_DISABLE：脚本中已设置 TORCH_COMPILE_DISABLE=1，避免 torch.compile 在 NPU 上引发兼容问题。
输入尺寸：DINOv2 ViT-S/14 默认输入尺寸为 518x518（远大于 DINOv3 ViT-S/16 的 256x256），预处理参数通过 timm.data.resolve_model_data_config 自动解析。
Batch Size 限制：由于输入分辨率为 518x518，单张图像占用显存较大，benchmark 中 batch size 上限设为 4。实际部署时请根据 NPU 显存容量调整。
Token 数量：forward_features() 返回 1374 个 tokens（1 cls + 4 registers + 1369 patches），其中 1369 = 37 x 37 源于 518/14 = 37。
本地权重加载：脚本通过 pretrained=False + load_state_dict() 实现纯本地加载，适配离线环境。
输出产物：inference.py、benchmark.py、accuracy.py 的运行结果均保存至 output/ 目录，包含 JSON 结构化数据与文本日志，便于后续自动化采集。

timm/vit_small_patch14_reg4_dinov2.lvd142m on Ascend NPU

1. 简介

与 DINOv3 系列不同，该模型属于 DINOv2 架构，训练数据为 LVD-142M，输入尺寸为 518x518，patch size 为 14，并包含 4 个 register tokens。

2. 权重下载方式

推荐使用以下方式下载模型权重：

AtomGit 下载

python3 -m atomgit download hf_mirrors/timm/vit_small_patch14_reg4_dinov2.lvd142m -d /opt/atomgit/weight/vit_small_patch14_reg4_dinov2.lvd142m

下载完成后，目录下应包含 pytorch_model.bin（或 model.safetensors）及 config.json 等文件。

3. 环境依赖安装

在运行验证脚本前，请确保已安装以下依赖：

pip install torch==2.9.0+cpu timm
pip install torch-npu==2.9.0.post1

torch 与 torch-npu 版本需与当前 CANN 版本匹配
timm 用于加载 TIMM 模型与图像预处理

4. 验证环境

组件	版本
`Python`	`3.11.14`
`PyTorch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`timm`	`1.0.27`
`CANN`	`8.0.RC2`

NPU：2 逻辑卡（Ascend910）
模型路径：/opt/atomgit/weight/vit_small_patch14_reg4_dinov2.lvd142m

5. 推理验证

运行单图特征提取推理，验证模型在 NPU 上的基础前向传播能力。

推理使用的样本图片：

sample_image

python3 inference.py

验证输出：

[INFO] Applied NPU monkey-patch (torch.cuda -> torch.npu)
[INFO] Loading image from: /opt/atomgit/vit_small_patch14_reg4_dinov2.lvd142m/data/val2017/000000000139.jpg
[INFO] Loading model: vit_small_patch14_reg4_dinov2.lvd142m
[INFO] Model moved to NPU device: Ascend910_9362
[INFO] Input shape: torch.Size([1, 3, 518, 518])
[INFO] Device: npu:0
[INFO] Warm-up inference...
[INFO] Running timed inference...

============================================================
Inference Results
============================================================
Device:        NPU (Ascend910_9362)
Latency:       10.75 ms
Pooled output shape:   torch.Size([1, 384])
Features shape:        torch.Size([1, 1374, 384])
Pooled output dtype:   torch.float32
Pooled output device:  npu:0
Pooled output first 5 values: [ 1.8751242 -0.5034357  1.8468268 -1.303242   1.1003127]
============================================================

验证结论：

模型成功加载至 NPU 设备
单张 518x518 图像推理延迟约 10.75 ms
输出特征维度符合预期（pooled [1, 384]，features [1, 1374, 384]）
1374 个 token 对应 1 cls + 4 registers + 1369 patches（518/14 = 37，37x37 = 1369）

6. 性能评测

使用 benchmark.py 对 NPU 推理进行延迟与吞吐量压测：

python3 benchmark.py

6.1 延迟测试结果

指标	数值
`iterations`	`20`
`mean_ms`	`5.23 ms`
`stddev_ms`	`0.05 ms`
`min_ms`	`5.13 ms`
`max_ms`	`5.31 ms`
`p50_ms`	`5.24 ms`
`p90_ms`	`5.30 ms`
`p99_ms`	`5.31 ms`

6.2 吞吐量测试结果

batch_size	throughput_ips	avg_latency_ms
`1`	`194.24`	`5.15`
`2`	`251.98`	`7.94`
`4`	`299.22`	`13.37`

验证结论：

稳定态单图平均延迟约 ~5.2 ms，对于 518x518 大分辨率输入表现优秀
Batch=4 时吞吐达到 299.22 images/sec
由于输入分辨率较大，batch_size 建议不超过 4，避免 NPU 内存溢出

7. 精度评测

使用 accuracy.py 对 NPU 输出与 CPU 基线进行精度对比，评估指标为向量级相对误差与余弦相似度：

python3 accuracy.py

验证输出：

============================================================
Accuracy Validation Results
============================================================

pooler_output:
  Shape:                  (1, 384)
  Vector Relative Error:  0.007823 (PASS)
  Cosine Similarity:      0.999970 (PASS)
  MSE:                    0.0000913260
  Max Absolute Diff:      0.025931
  Overall:                PASS

features:
  Shape:                  (1, 1374, 384)
  Vector Relative Error:  0.004343 (PASS)
  Cosine Similarity:      0.999990 (PASS)
  MSE:                    0.0000310854
  Max Absolute Diff:      0.060189
  Overall:                PASS

============================================================
OVERALL: PASS (vector-level relative error < 1% and cosine similarity > 0.999)
============================================================

指标	数值	阈值	结果
Pooled Output 向量相对误差	`0.78%`	`< 1%`	PASS
Pooled Output 余弦相似度	`0.999970`	`> 0.999`	PASS
Features 向量相对误差	`0.43%`	`< 1%`	PASS
Features 余弦相似度	`0.999990`	`> 0.999`	PASS

验证结论：

NPU 输出与 CPU 基线精度误差均 < 1%
余弦相似度均 > 0.999，特征空间一致性高
整体精度验证通过

8. 注意事项

Monkey-patch 适配方式：当前采用运行时 torch.cuda -> torch.npu monkey-patch 实现适配，无需修改 timm 源码。
TORCH_COMPILE_DISABLE：脚本中已设置 TORCH_COMPILE_DISABLE=1，避免 torch.compile 在 NPU 上引发兼容问题。
输入尺寸：DINOv2 ViT-S/14 默认输入尺寸为 518x518（远大于 DINOv3 ViT-S/16 的 256x256），预处理参数通过 timm.data.resolve_model_data_config 自动解析。
Batch Size 限制：由于输入分辨率为 518x518，单张图像占用显存较大，benchmark 中 batch size 上限设为 4。实际部署时请根据 NPU 显存容量调整。
Token 数量：forward_features() 返回 1374 个 tokens（1 cls + 4 registers + 1369 patches），其中 1369 = 37 x 37 源于 518/14 = 37。
本地权重加载：脚本通过 pretrained=False + load_state_dict() 实现纯本地加载，适配离线环境。
输出产物：inference.py、benchmark.py、accuracy.py 的运行结果均保存至 output/ 目录，包含 JSON 结构化数据与文本日志，便于后续自动化采集。