DINOv3 ViT 模型图像特征编码器。基于 DINOv3 ViT-7B 模型在 LVD-1689M 数据集上蒸馏得到。
timm 中,已为模型禁用 QKV 偏置(qkv_bias=False),且未加载这些零权重。对于部分模型尺寸,存在名称中包含 qkvb 的变体,这些变体启用了偏置(qkv_bias=True),但偏置值仍为零,以匹配 transformers 和原始模型的行为。bfloat16 缓冲区。timm 在初始化时生成 float32 周期。这会导致一些数值差异,但 timm 的方法在不支持 bfloat16 的设备上运行时问题更少,并且在微调时表现相同甚至略好。执行 model.rope.periods = model.rope.periods.to(torch.bfloat16).to(torch.float32) 会将周期截断为 bfloat16,从而产生匹配的输出。from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model('vit_small_patch16_dinov3.lvd1689m', pretrained=True)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'vit_small_patch16_dinov3.lvd1689m',
pretrained=True,
features_only=True,
)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
for o in output:
# print shape of each feature map in output
# e.g.:
# torch.Size([1, 384, 16, 16])
# torch.Size([1, 384, 16, 16])
# torch.Size([1, 384, 16, 16])
print(o.shape)from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'vit_small_patch16_dinov3.lvd1689m',
pretrained=True,
num_classes=0, # remove classifier nn.Linear
)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor
# or equivalently (without needing to set num_classes=0)
output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 261, 384) shaped tensor
output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor有关评估协议的详细信息,请参见相关论文
| 模型 | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair |
|---|---|---|---|---|---|---|---|---|---|
| 全局任务 | 密集任务 | ||||||||
| DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 49.5 | 47.0 | 0.403 | 72.7 | 56.3 | 50.4 |
| DINOv3 ViT-S+/16 | 88.0 | 68.8 | 54.6 | 50.0 | 48.8 | 0.399 | 75.5 | 57.1 | 55.2 |
| DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 |
| DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 63.1 | 54.9 | 0.352 | 79.9 | 62.3 | 61.3 |
| DINOv3 ViT-H+/16 | 90.3 | 90.0 | 78.6 | 64.5 | 54.8 | 0.352 | 79.3 | 63.3 | 56.3 |
| DINOv3 ViT-7B/16 | 90.4 | 91.1 | 91.1 | 72.8 | 55.9 | 0.309 | 79.7 | 64.4 | 58.7 |
| 模型 | IN-ReaL @256px | IN-ReaL @512px | IN-R @256px | IN-R @512px | Obj.Net @256px | Obj.Net @512px | ADE20k | NYU↓ |
|---|---|---|---|---|---|---|---|---|
| 全局任务 | 密集任务 | |||||||
| DINOv3 ConvNeXt Tiny | 86.6 | 87.7 | 73.7 | 74.1 | 52.6 | 58.7 | 42.7 | 0.448 |
| DINOv3 ConvNeXt Small | 87.9 | 88.7 | 73.7 | 74.1 | 52.6 | 58.7 | 44.8 | 0.432 |
| DINOv3 ConvNeXt Base | 88.5 | 89.2 | 77.2 | 78.2 | 56.2 | 61.3 | 46.3 | 0.420 |
| DINOv3 ConvNeXt Large | 88.9 | 89.4 | 81.3 | 82.4 | 59.3 | 65.2 | 47.8 | 0.403 |
| 模型 | m-BEnet | m-brick-kiln | m-eurosat | m-forestnet | m-pv4ger | m-so2sat | 均值 |
|---|---|---|---|---|---|---|---|
| DINOv3 ViT-L/16 | 73.0 | 96.5 | 94.1 | 60.6 | 96.0 | 57.4 | 79.6 |
| DINOv3 ViT-7B/16 | 74.0 | 97.2 | 94.8 | 62.3 | 96.1 | 62.1 | 81.1 |
| 模型 | m-cashew | m-chesapeake | m-NeonTree | m-nz-cattle | m-pv4ger-seg | m-SA-crop | 均值 |
|---|---|---|---|---|---|---|---|
| DINOv3 ViT-L/16 | 94.2 | 75.6 | 61.8 | 83.7 | 95.2 | 36.8 | 74.5 |
| DINOv3 ViT-7B/16 | 94.1 | 76.6 | 62.6 | 83.4 | 95.5 | 37.6 | 75.0 |
@article{simeoni2025dinov3,
title={DINOv3},
author={Sim{'e}oni, Oriane and Vo, Huy V and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{"e}l and others},
journal={arXiv preprint arXiv:2508.10104},
year={2025}
}
}@article{dosovitskiy2020vit,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
journal={ICLR},
year={2021}
}@misc{rw2019timm,
author = {Ross Wightman},
title = {PyTorch Image Models},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.4414861},
howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}本节记录 vit_small_patch16_dinov3.lvd1689m 在华为昇腾 Ascend910B4 NPU 上的适配与验证情况。
| 组件 | 版本 |
|---|---|
torch | 2.9.0+cpu |
torch-npu | 2.9.0 |
timm | 1.0.27 |
transformers | 4.57.6 |
CANN | 8.5.1 |
1 张逻辑卡(Ascend910B4)2026-05-09由于 timm 默认从 HuggingFace Hub 拉取权重,建议将 HF_ENDPOINT 设置为国内镜像站点:
export HF_ENDPOINT=https://hf-mirror.com加载模型并将其移至NPU的示例代码:
import torch
import torch_npu
import timm
device = torch.device("npu:0")
torch.npu.set_device(device)
model = timm.create_model(
"vit_small_patch16_dinov3.lvd1689m",
pretrained=True
)
model = model.to(device).eval()
# Get model preprocessing config
data_config = timm.data.resolve_model_data_config(model)
input_size = data_config["input_size"] # (3, 256, 256)该模型可直接在NPU上运行,无需进行任何额外的算子替换或代码修改。timm的VisionTransformer实现与torch_npu完全兼容。
import torch
import torch_npu
import timm
device = torch.device("npu:0")
model = timm.create_model("vit_small_patch16_dinov3.lvd1689m", pretrained=True)
model = model.to(device).eval()
dummy_input = torch.randn(1, 3, 256, 256).to(device)
with torch.no_grad():
output = model(dummy_input)
print(f"Output shape: {output.shape}") # torch.Size([1, 384])
print(f"Output range: [{output.min():.4f}, {output.max():.4f}]")验证结果:
[1, 384](符合预期)[-1.1397, 2.0712]-0.0053使用相同的权重和输入,对比 NPU 与 CPU 的输出差异:
| 指标 | 数值 |
|---|---|
| 最大绝对误差 | 0.002358 |
| 平均绝对误差 | 0.000584 |
结论:NPU 推理结果与 CPU 基准的浮点差异在 1e-2 数量级以内,满足视觉特征提取任务的精度要求。
在单张 Ascend910B4 卡上,使用 fp32 精度测试推理吞吐量和延迟(不包含预处理)。
| 批大小 | 吞吐量(样本/秒) | 延迟(毫秒/批) |
|---|---|---|
| 1 | 30.89 | 32.38 |
| 4 | 233.97 | 17.10 |
| 8 | 463.21 | 17.27 |
| 16 | 927.35 | 17.25 |
测试说明:
3 x 256 x 25610100torch.npu.synchronize()从数据来看,批大小为 16 时实现最佳吞吐量(约 927 样本/秒),单张图像平均推理延迟约为 1.1 毫秒。
本验证所用的完整脚本随仓库一同提供:
python validate_npu.py该脚本包含以下内容:
export HF_ENDPOINT=https://hf-mirror.com。否则,timm 从 HuggingFace 官方网站下载权重时会超时。torch_npu 时,若提示 can not create directory: /home/atomgit/ascend/log,可忽略此提示,不影响推理。若要消除该提示,可手动创建该目录。256 x 256 的输入,与标准 ViT 的 224 x 224 不同。使用时请注意 input_size 参数。fp32。如需进一步加速,可尝试使用 torch.npu.set_amp_dtype(torch.float16) 进行混合精度推理,但需额外验证精度损失情况。| 文件 | 说明 |
|---|---|
validate_npu.py | NPU 适配验证脚本 |
README.md | 适配文档(本文档) |
validation_output.txt | 验证脚本的原始输出日志 |