带寄存器的视觉Transformer(ViT)图像特征模型。使用自监督DINOv2方法在LVD-142M上进行预训练。
from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model('vit_small_patch14_reg4_dinov2.lvd142m', pretrained=True)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'vit_small_patch14_reg4_dinov2.lvd142m',
pretrained=True,
num_classes=0, # remove classifier nn.Linear
)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor
# or equivalently (without needing to set num_classes=0)
output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1374, 384) shaped tensor
output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor在 timm 的 model results 中探索该模型的数据集和运行时指标。
@article{darcet2023vision,
title={Vision Transformers Need Registers},
author={Darcet, Timoth{'e}e and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
journal={arXiv preprint arXiv:2309.16588},
year={2023}
}@misc{oquab2023dinov2,
title={DINOv2: Learning Robust Visual Features without Supervision},
author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
journal={arXiv:2304.07193},
year={2023}
}@article{dosovitskiy2020vit,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
journal={ICLR},
year={2021}
}@misc{rw2019timm,
author = {Ross Wightman},
title = {PyTorch Image Models},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.4414861},
howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}本节记录 vit_small_patch14_reg4_dinov2.lvd142m 在华为昇腾 Ascend910B4 NPU 上的适配与验证过程。
| 组件 | 版本 |
|---|---|
torch | 2.9.0+cpu |
torch-npu | 2.9.0 |
timm | 1.0.27 |
transformers | 4.57.6 |
CANN | 8.5.1 |
1 张逻辑卡(Ascend910B4)2026-05-09由于 timm 默认从 HuggingFace Hub 拉取权重,建议将 HF_ENDPOINT 设置为国内镜像站点:
export HF_ENDPOINT=https://hf-mirror.com加载模型并将其移至NPU的示例代码:
import torch
import torch_npu
import timm
device = torch.device("npu:0")
torch.npu.set_device(device)
model = timm.create_model(
"vit_small_patch14_reg4_dinov2.lvd142m",
pretrained=True
)
model = model.to(device).eval()
# Get model preprocessing config
data_config = timm.data.resolve_model_data_config(model)
input_size = data_config["input_size"] # (3, 518, 518)该模型可直接在NPU上运行,无需进行任何额外的算子替换或代码修改。timm的VisionTransformer实现与torch_npu完全兼容。
import torch
import torch_npu
import timm
device = torch.device("npu:0")
model = timm.create_model("vit_small_patch14_reg4_dinov2.lvd142m", pretrained=True)
model = model.to(device).eval()
dummy_input = torch.randn(1, 3, 518, 518).to(device)
with torch.no_grad():
output = model(dummy_input)
print(f"Output shape: {output.shape}") # torch.Size([1, 384])
print(f"Output range: [{output.min():.4f}, {output.max():.4f}]")验证结果:
[1, 384](符合预期)[-4.2467, 6.4520]-0.0271使用相同的权重和输入,对比 NPU 与 CPU 的输出差异:
| 指标 | 数值 |
|---|---|
| 最大绝对差值 | 0.018992 |
| 平均绝对差值 | 0.005008 |
结论:NPU 推理结果与 CPU 基准的浮点差异在 1e-2 数量级以内,满足视觉特征提取任务的精度要求。
在单张 Ascend910B4 卡上,使用 fp32 精度测试推理吞吐量和延迟(不含预处理)。
| 批处理大小 | 吞吐量(样本/秒) | 延迟(毫秒/批) |
|---|---|---|
| 1 | 94.48 | 10.58 |
| 4 | 254.01 | 15.75 |
| 8 | 267.66 | 29.89 |
| 16 | 253.84 | 63.03 |
测试说明:
3 x 518 x 51810100torch.npu.synchronize()从数据来看,批处理大小为 8 时实现最佳吞吐量(约 267 样本/秒),单张图像平均推理延迟约为 3.7 毫秒。
本验证所使用的完整脚本随代码库一同提供:
python validate_npu.py该脚本包含以下内容:
export HF_ENDPOINT=https://hf-mirror.com。否则,timm 从 HuggingFace 官方站点下载权重时会超时。torch_npu 时提示 can not create directory: /home/atomgit/ascend/log,可忽略此提示,不影响推理。若需消除该提示,手动创建该目录即可。518 x 518,与标准 ViT 的 224 x 224 不同。使用时请注意 input_size 参数。fp32 精度。如需进一步加速,可尝试通过 torch.npu.set_amp_dtype(torch.float16) 进行混合精度推理,但需额外验证精度损失情况。| 文件 | 描述 |
|---|---|
model.safetensors | 模型权重(约 85MB) |
validate_npu.py | NPU 适配验证脚本 |
README.md | 适配文档(本文档) |
validation_output.txt | 验证脚本的原始输出日志 |