DINOv3 ViT-B/16 on Ascend NPU

1. 简介

本文档记录 facebook/dinov3-vitb16-pretrain-lvd1689m 在昇腾 NPU 上的推理适配与验证结果。

DINOv3 是 Meta AI 发布的视觉基础模型，基于 Vision Transformer 架构，在大量网络数据上预训练，无需微调即可在多种视觉任务上取得优异性能。本验证基于 PyTorch + torch_npu 方案，直接在昇腾 NPU 上跑通单图推理，并完成精度与性能评测。

2. 验证环境

组件	版本
`transformers`	`4.57.6`
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`torchvision`	`0.24.0`
`Pillow`	`10.4.0`
`numpy`	`1.26.4`

NPU：Ascend910 × 2 逻辑卡
模型路径：/opt/atomgit/dinov3-vitb16
OS：Linux 5.10.0 (aarch64)
CANN：8.3.RC1

3. 快速开始

3.1 环境准备

# 加载 CANN 环境
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# 指定可见 NPU 卡
export ASCEND_RT_VISIBLE_DEVICES=0

# 验证 torch_npu
python3 -c "import torch; import torch_npu; a = torch.randn(3,4).npu(); print(a + a)"

3.2 下载模型

git clone https://gitcode.com/hf_mirrors/facebook/dinov3-vitb16-pretrain-lvd1689m.git
cd dinov3-vitb16-pretrain-lvd1689m

# 若未安装 git-lfs，需先安装再拉取权重
# git lfs install && git lfs pull

3.3 运行推理

单设备推理（自动选择 NPU）：

python3 inference.py --model_path . --num_runs 10

CPU vs NPU 精度对比：

python3 inference.py --model_path . --compare --num_runs 10

完整性能与精度评测：

python3 benchmark.py --model_path . --num_runs 20 --output benchmark_report.json

4. 推理输出示例

模型路径: .
输入图片: /tmp/dinov3_test_image.png
PyTorch 版本: 2.9.0+cpu
torch_npu 版本: 2.9.0.post1+gitee7ba04

正在加载模型...
模型加载完成

============================================================
【CPU 基线推理】
============================================================

[CPU] Output keys: ['last_hidden_state', 'pooler_output']
[CPU] last_hidden_state shape: torch.Size([1, 201, 768]), dtype: torch.float32
[CPU] last_hidden_state mean: 0.001654, std: 0.448173
[CPU] pooler_output shape: torch.Size([1, 768]), dtype: torch.float32
[CPU] pooler_output mean: -0.006058, std: 0.602763

[CPU] 平均延迟: 648.703 ms (min=645.014, max=651.202)

============================================================
【NPU 推理】
============================================================

[NPU] Output keys: ['last_hidden_state', 'pooler_output']
[NPU] last_hidden_state shape: torch.Size([1, 201, 768]), dtype: torch.float32
[NPU] last_hidden_state mean: 0.001640, std: 0.448310
[NPU] pooler_output shape: torch.Size([1, 768]), dtype: torch.float32
[NPU] pooler_output mean: -0.006072, std: 0.602642

[NPU] 平均延迟: 8.787 ms (min=8.690, max=8.925)

============================================================
精度对比 (CPU vs NPU)
============================================================
last_hidden_state: max_diff=1.256609e-02, mean_diff=1.084065e-03, rel_err=3.502032e-03
pooler_output: max_diff=7.614687e-03, mean_diff=1.779583e-03, rel_err=3.754067e-03
------------------------------------------------------------
精度验证结果: PASS (relative_error < 1%)
============================================================

NPU 加速比 (vs CPU): 73.82x

5. 性能评测

测试条件：单张 224×224 随机图像，连续 20 次推理，取第二次及以后数据。

指标	CPU	NPU
`mean_ms`	`653.718 ms`	`8.756 ms`
`min_ms`	`651.502 ms`	`8.531 ms`
`max_ms`	`657.732 ms`	`9.366 ms`
`p50_ms`	`653.744 ms`	`8.598 ms`
`p99_ms`	`657.579 ms`	`9.335 ms`
`std_ms`	`1.520 ms`	`0.278 ms`
`throughput_img_per_sec`	`1.53`	`114.20`
加速比	—	74.66x

6. 精度评测

使用 CPU 基线输出与 NPU 输出进行逐 tensor 对比，以 相对误差 < 1% 作为通过标准。

Tensor	Shape	Max Diff	Mean Diff	Relative Error	1% 通过
`last_hidden_state`	`[1, 201, 768]`	`1.26e-02`	`1.08e-03`	`0.35%`	✅
`pooler_output`	`[1, 768]`	`7.61e-03`	`1.78e-03`	`0.38%`	✅

精度验证结论：PASS（全部 tensor 相对误差 < 1%）

7. 注意事项

git-lfs：GitCode 镜像中的 model.safetensors 使用 Git LFS 托管，克隆后务必执行 git lfs pull 拉取实际权重（约 327 MB）。
trust_remote_code：DINOv3 在 transformers 4.57.6 中已原生支持 dinov3_vit 架构，但 AutoModel.from_pretrained 仍需传入 trust_remote_code=True。
设备同步：NPU 推理测速时必须在计时前后调用 torch.npu.synchronize()，否则测得的延迟会偏小。
算子兼容：本模型全部为标准 Transformer 算子（Conv2d patch embedding、LayerNorm、GELU MLP、Self-Attention），无需任何自定义 CUDA 算子适配，可直接在 NPU 上运行。
精度说明：NPU 与 CPU 的 fp32 推理存在微小数值差异（相对误差 < 0.4%），属于正常范围，不影响下游任务使用。

8. 相关文件

文件	说明
`inference.py`	单图推理脚本，支持 CPU/NPU 及精度对比
`benchmark.py`	性能与精度评测脚本，输出 JSON 报告
`benchmark_report.json`	本次验证的完整评测数据
`inference_compare.log`	CPU vs NPU 推理对比日志