facebook/dinov3-vits16-pretrain-lvd1689m on Ascend NPU

1. 简介

本文档记录 facebook/dinov3-vits16-pretrain-lvd1689m 在 Ascend NPU 环境的推理适配与验证结果。该模型为 Meta AI 发布的 Vision Transformer 基础模型（ViT-S/16 变体），基于 transformers 库直接加载推理，通过 runtime monkey-patch 方式将 torch.cuda 调用透明转发至 torch.npu，无需修改原始模型代码即可完成 NPU 适配。

2. 权重下载方式

推荐使用以下任一方式下载模型权重：

方式一：ModelScope 下载（推荐）

modelscope download --model facebook/dinov3-vits16-pretrain-lvd1689m --local_dir /opt/atomgit/weight/dinov3-vits16-pretrain-lvd1689m

方式二：AtomGit 下载

python3 -m atomgit download hf_mirrors/facebook/dinov3-vits16-pretrain-lvd1689m -d /opt/atomgit/weight/dinov3-vits16-pretrain-lvd1689m

3. 环境依赖安装

在运行验证脚本前，请确保已安装以下依赖：

pip install torch==2.9.0+cpu transformers
pip install torch-npu==2.9.0.post1

torch 与 torch-npu 版本需与当前 CANN 版本匹配
transformers 用于加载模型与图像预处理
如需从 ModelScope 下载权重，额外安装：pip install modelscope

4. 验证环境

组件	版本
`Python`	`3.11.14`
`PyTorch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`transformers`	`4.57.6`
`CANN`	`8.0.RC2`

NPU：2 逻辑卡（Ascend910）
模型路径：/opt/atomgit/weight/dinov3-vits16-pretrain-lvd1689m

5. 推理验证

运行单图特征提取推理，验证模型在 NPU 上的基础前向传播能力。

推理使用的样本图片：

sample_image

python3 inference.py

验证输出：

[INFO] Applied NPU monkey-patch (torch.cuda -> torch.npu)
[INFO] Loading image from: /opt/atomgit/dinov3-vits16-pretrain-lvd1689m/data/val2017/000000000139.jpg
[INFO] Loading model from: /opt/atomgit/weight/dinov3-vits16-pretrain-lvd1689m
[INFO] Model moved to NPU device: Ascend910_9362
[INFO] Input shape: torch.Size([1, 3, 224, 224])
[INFO] Device: npu:0
[INFO] Warm-up inference...
[INFO] Running timed inference...

============================================================
Inference Results
============================================================
Device:        NPU (Ascend910_9362)
Latency:       8.64 ms
Pooled output shape:   torch.Size([1, 384])
Last hidden state shape: torch.Size([1, 201, 1024])
Pooled output dtype:   torch.float32
Pooled output device:  npu:0
Pooled output first 5 values: [ 0.23145083  0.5774369  -0.21135539 -1.1499186   0.24691251]
============================================================

验证结论：

模型成功加载至 NPU 设备
单张 224x224 图像推理延迟约 8.64 ms
输出特征维度符合 ViT-S/16 预期（pooled [1, 384]，last_hidden_state [1, 201, 384]）

6. 性能评测

使用 benchmark.py 对 NPU 推理进行延迟与吞吐量压测：

python3 benchmark.py

6.1 延迟测试结果

指标	数值
`iterations`	`20`
`mean_ms`	`7.94 ms`
`stddev_ms`	`0.28 ms`
`min_ms`	`7.69 ms`
`max_ms`	`8.42 ms`
`p50_ms`	`7.77 ms`
`p90_ms`	`8.37 ms`
`p99_ms`	`8.42 ms`

6.2 吞吐量测试结果

batch_size	throughput_ips	avg_latency_ms
`1`	`126.75`	`7.89`
`2`	`254.18`	`7.87`
`4`	`510.02`	`7.84`
`8`	`1006.24`	`7.95`

验证结论：

单图平均延迟稳定在 ~8 ms，与 ViT-B/16 接近，模型更轻量但 NPU 利用率已较高
Batch=8 时吞吐达到 1006.24 images/sec，突破千张/秒
小 batch 场景下延迟几乎无增长，适合高并发实时特征提取

7. 精度评测

使用 accuracy.py 对 NPU 输出与 CPU 基线进行精度对比，评估指标为向量级相对误差与余弦相似度：

python3 accuracy.py

验证输出：

============================================================
Accuracy Validation Results
============================================================

pooler_output:
  Shape:                  (1, 384)
  Vector Relative Error:  0.003139 (PASS)
  Cosine Similarity:      0.999995 (PASS)
  MSE:                    0.0000016848
  Max Absolute Diff:      0.003499
  Overall:                PASS

last_hidden_state:
  Shape:                  (1, 201, 384)
  Vector Relative Error:  0.001936 (PASS)
  Cosine Similarity:      0.999999 (PASS)
  MSE:                    0.0000005143
  Max Absolute Diff:      0.003881
  Overall:                PASS

============================================================
OVERALL: PASS (vector-level relative error < 1% and cosine similarity > 0.999)
============================================================

指标	数值	阈值	结果
Pooler Output 向量相对误差	`0.31%`	`< 1%`	PASS
Pooler Output 余弦相似度	`0.999995`	`> 0.999`	PASS
Last Hidden State 向量相对误差	`0.19%`	`< 1%`	PASS
Last Hidden State 余弦相似度	`0.999999`	`> 0.999`	PASS

验证结论：

NPU 输出与 CPU 基线精度误差均 < 1%
余弦相似度均 > 0.999，特征空间一致性极高
整体精度验证通过，且误差水平优于 ViT-B/16 与 ViT-L/16

8. 注意事项

Monkey-patch 适配方式：当前采用运行时 torch.cuda -> torch.npu monkey-patch 实现适配，无需修改 transformers 源码。若后续 transformers 版本内部硬编码了 CUDA 特定调用，可能需要更新 patch 范围。
TORCH_COMPILE_DISABLE：脚本中已设置 TORCH_COMPILE_DISABLE=1，避免 torch.compile 在 NPU 上引发兼容问题。
输入尺寸约束：DINOv3 ViT-S/16 的 patch size 为 16，输入图像尺寸建议为 224x224 或 16 的整数倍，否则模型会自动 crop 到最接近的倍数。
ViT-S 与 ViT-B/L 差异：ViT-S 的 embedding dimension 为 384（ViT-B 为 768，ViT-L 为 1024），模型权重仅约 82 MB，推理延迟与 ViT-B 相近但吞吐更高，是端侧与高并发场景的首选。
输出产物：inference.py、benchmark.py、accuracy.py 的运行结果均保存至 output/ 目录，包含 JSON 结构化数据与文本日志，便于后续自动化采集。

facebook/dinov3-vits16-pretrain-lvd1689m on Ascend NPU

1. 简介

2. 权重下载方式

推荐使用以下任一方式下载模型权重：

方式一：ModelScope 下载（推荐）

modelscope download --model facebook/dinov3-vits16-pretrain-lvd1689m --local_dir /opt/atomgit/weight/dinov3-vits16-pretrain-lvd1689m

方式二：AtomGit 下载

python3 -m atomgit download hf_mirrors/facebook/dinov3-vits16-pretrain-lvd1689m -d /opt/atomgit/weight/dinov3-vits16-pretrain-lvd1689m

3. 环境依赖安装

在运行验证脚本前，请确保已安装以下依赖：

pip install torch==2.9.0+cpu transformers
pip install torch-npu==2.9.0.post1

torch 与 torch-npu 版本需与当前 CANN 版本匹配
transformers 用于加载模型与图像预处理
如需从 ModelScope 下载权重，额外安装：pip install modelscope

4. 验证环境

组件	版本
`Python`	`3.11.14`
`PyTorch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`transformers`	`4.57.6`
`CANN`	`8.0.RC2`

NPU：2 逻辑卡（Ascend910）
模型路径：/opt/atomgit/weight/dinov3-vits16-pretrain-lvd1689m

5. 推理验证

运行单图特征提取推理，验证模型在 NPU 上的基础前向传播能力。

推理使用的样本图片：

sample_image

python3 inference.py

验证输出：

[INFO] Applied NPU monkey-patch (torch.cuda -> torch.npu)
[INFO] Loading image from: /opt/atomgit/dinov3-vits16-pretrain-lvd1689m/data/val2017/000000000139.jpg
[INFO] Loading model from: /opt/atomgit/weight/dinov3-vits16-pretrain-lvd1689m
[INFO] Model moved to NPU device: Ascend910_9362
[INFO] Input shape: torch.Size([1, 3, 224, 224])
[INFO] Device: npu:0
[INFO] Warm-up inference...
[INFO] Running timed inference...

============================================================
Inference Results
============================================================
Device:        NPU (Ascend910_9362)
Latency:       8.64 ms
Pooled output shape:   torch.Size([1, 384])
Last hidden state shape: torch.Size([1, 201, 1024])
Pooled output dtype:   torch.float32
Pooled output device:  npu:0
Pooled output first 5 values: [ 0.23145083  0.5774369  -0.21135539 -1.1499186   0.24691251]
============================================================

验证结论：

模型成功加载至 NPU 设备
单张 224x224 图像推理延迟约 8.64 ms
输出特征维度符合 ViT-S/16 预期（pooled [1, 384]，last_hidden_state [1, 201, 384]）

6. 性能评测

使用 benchmark.py 对 NPU 推理进行延迟与吞吐量压测：

python3 benchmark.py

6.1 延迟测试结果

指标	数值
`iterations`	`20`
`mean_ms`	`7.94 ms`
`stddev_ms`	`0.28 ms`
`min_ms`	`7.69 ms`
`max_ms`	`8.42 ms`
`p50_ms`	`7.77 ms`
`p90_ms`	`8.37 ms`
`p99_ms`	`8.42 ms`

6.2 吞吐量测试结果

batch_size	throughput_ips	avg_latency_ms
`1`	`126.75`	`7.89`
`2`	`254.18`	`7.87`
`4`	`510.02`	`7.84`
`8`	`1006.24`	`7.95`

验证结论：

单图平均延迟稳定在 ~8 ms，与 ViT-B/16 接近，模型更轻量但 NPU 利用率已较高
Batch=8 时吞吐达到 1006.24 images/sec，突破千张/秒
小 batch 场景下延迟几乎无增长，适合高并发实时特征提取

7. 精度评测

使用 accuracy.py 对 NPU 输出与 CPU 基线进行精度对比，评估指标为向量级相对误差与余弦相似度：

python3 accuracy.py

验证输出：

============================================================
Accuracy Validation Results
============================================================

pooler_output:
  Shape:                  (1, 384)
  Vector Relative Error:  0.003139 (PASS)
  Cosine Similarity:      0.999995 (PASS)
  MSE:                    0.0000016848
  Max Absolute Diff:      0.003499
  Overall:                PASS

last_hidden_state:
  Shape:                  (1, 201, 384)
  Vector Relative Error:  0.001936 (PASS)
  Cosine Similarity:      0.999999 (PASS)
  MSE:                    0.0000005143
  Max Absolute Diff:      0.003881
  Overall:                PASS

============================================================
OVERALL: PASS (vector-level relative error < 1% and cosine similarity > 0.999)
============================================================

指标	数值	阈值	结果
Pooler Output 向量相对误差	`0.31%`	`< 1%`	PASS
Pooler Output 余弦相似度	`0.999995`	`> 0.999`	PASS
Last Hidden State 向量相对误差	`0.19%`	`< 1%`	PASS
Last Hidden State 余弦相似度	`0.999999`	`> 0.999`	PASS

验证结论：

NPU 输出与 CPU 基线精度误差均 < 1%
余弦相似度均 > 0.999，特征空间一致性极高
整体精度验证通过，且误差水平优于 ViT-B/16 与 ViT-L/16

8. 注意事项

Monkey-patch 适配方式：当前采用运行时 torch.cuda -> torch.npu monkey-patch 实现适配，无需修改 transformers 源码。若后续 transformers 版本内部硬编码了 CUDA 特定调用，可能需要更新 patch 范围。
TORCH_COMPILE_DISABLE：脚本中已设置 TORCH_COMPILE_DISABLE=1，避免 torch.compile 在 NPU 上引发兼容问题。
输入尺寸约束：DINOv3 ViT-S/16 的 patch size 为 16，输入图像尺寸建议为 224x224 或 16 的整数倍，否则模型会自动 crop 到最接近的倍数。
ViT-S 与 ViT-B/L 差异：ViT-S 的 embedding dimension 为 384（ViT-B 为 768，ViT-L 为 1024），模型权重仅约 82 MB，推理延迟与 ViT-B 相近但吞吐更高，是端侧与高并发场景的首选。
输出产物：inference.py、benchmark.py、accuracy.py 的运行结果均保存至 output/ 目录，包含 JSON 结构化数据与文本日志，便于后续自动化采集。