带寄存器的视觉Transformer(ViT)图像特征模型。使用自监督DINOv2方法在LVD-142M上预训练。
本文档记录 vit_small_patch14_reg4_dinov2.lvd142m 在华为昇腾 NPU(Ascend910B4)上的适配与验证结果。
| 组件 | 版本 |
|---|---|
CANN | 8.5.1 |
torch | 2.9.0+cpu |
torch-npu | 2.9.0.post1+gitee7ba04 |
timm | 1.0.27 |
transformers | 4.57.6 |
safetensors | >=0.4 |
Ascend910B4(1 逻辑卡)./model.safetensors518 x 518384pip install torch torch_npu timm pillow safetensors numpy注意:
torch_npu需与 CANN 版本匹配,请参考昇腾官方文档安装。
git clone https://gitcode.com/hf_mirrors/timm/vit_small_patch14_reg4_dinov2.lvd142m.git
cd vit_small_patch14_reg4_dinov2.lvd142m若 model.safetensors 为 git-lfs 指针文件,可通过以下方式下载权重:
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id="timm/vit_small_patch14_reg4_dinov2.lvd142m",
filename="model.safetensors",
local_dir=".",
local_dir_use_symlinks=False,
)python3 inference.py --device npu --image your_image.jpg或直接在 Python 中调用:
import torch
import timm
from safetensors.torch import load_file
from PIL import Image
# Load model
model = timm.create_model(
"vit_small_patch14_reg4_dinov2.lvd142m",
pretrained=False,
num_classes=0,
)
state_dict = load_file("model.safetensors")
model.load_state_dict(state_dict, strict=False)
model = model.eval().to("npu")
# Prepare input
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
img = Image.open("your_image.jpg").convert("RGB")
img_tensor = transforms(img).unsqueeze(0).to("npu")
# Inference
with torch.no_grad():
features = model(img_tensor)
print(features.shape) # torch.Size([1, 384])以下为在 Ascend910B4 上运行推理的实际输出日志(inference.log):
Device: npu
NPU Name: Ascend910B4
Loading model...
Input size: (3, 518, 518)
Input tensor shape: torch.Size([1, 3, 518, 518])
Output shape: torch.Size([1, 384])
Output dtype: torch.float32
Output device: npu:0
Latency (ms): min=11.119, max=11.637, avg=11.307
Throughput: 88.44 images/sec结论: 模型在 NPU 上推理正常,输出 shape 为 [1, 384],数据类型 float32,设备为 npu:0,单图推理平均延迟约 11.3ms。
python3 inference.py --device npu对比 CPU 与 NPU 在相同输入下的输出特征向量,采用多指标综合判定:
python3 accuracy_benchmark.py| 指标 | 数值 |
|---|---|
max_absolute_error | 0.016564 |
mean_absolute_error | 0.004838 |
max_relative_error | 0.9488 |
mean_relative_error | 1.736% |
median_relative_error | 0.449% |
mean_relative_error_trimmed | 0.709% |
cosine_similarity | 0.999990 |
rmse | 0.006102 |
CPU 输出示例:tensor([-1.9629, 0.0178, 1.6286, 2.0247, 1.0157])
NPU 输出示例:tensor([-1.9603, 0.0233, 1.6169, 2.0112, 1.0147], device='npu:0')
| 检查项 | 阈值 | 实际值 | 结果 |
|---|---|---|---|
| 截断均值相对误差 | < 1% | 0.7095% | ✅ 通过 |
| 余弦相似度 | > 0.999 | 0.999990 | ✅ 通过 |
| 平均绝对误差 | < 0.01 | 0.004838 | ✅ 通过 |
综合判定:✅ 通过 — NPU 推理精度在可接受范围内,与 CPU 输出高度一致。
使用 perf_benchmark.py 在 NPU 上测试不同 batch size 的推理延迟与吞吐量,预热(warmup)=10,重复次数(repeats)=50。
python3 perf_benchmark.py --batch-sizes 1 2 4 8| 批处理大小 | 平均延迟(毫秒) | P50 延迟(毫秒) | P90 延迟(毫秒) | 吞吐量(张/秒) |
|---|---|---|---|---|
| 1 | 11.119 | 11.082 | 11.446 | 89.94 |
| 2 | 10.908 | 10.893 | 11.076 | 183.35 |
| 4 | 16.070 | 16.066 | 16.112 | 248.91 |
| 8 | 29.971 | 29.955 | 30.057 | 266.92 |
| 文件 | 说明 |
|---|---|
inference.py | 基础推理脚本,支持 CPU/NPU |
accuracy_benchmark.py | 精度评测脚本(CPU 与 NPU 对比),自动判定通过/失败 |
perf_benchmark.py | 性能基准测试脚本(多批处理大小) |
accuracy_benchmark.log | 精度评测运行日志 |
perf_benchmark.log | 性能评测运行日志 |
inference.log | 推理冒烟测试日志(推理正常输出证据) |
accuracy_result.pt | 精度评测详细结果(PyTorch 格式) |
perf_result.json | 性能评测详细结果(JSON 格式) |
fusion_result.json | 算子融合日志 |
config.json | 模型配置 |
本地权重加载:由于网络环境限制,脚本使用 safetensors.torch.load_file 直接从本地 model.safetensors 加载权重,避免依赖 HuggingFace 在线下载。
双精度类型转换警告:NPU 当前不支持 double 精度,部分内部操作会自动转换为 float。该转换对最终输出精度影响可控(截断均值相对误差 < 1%)。
输入尺寸固定:该模型配置 fixed_input_size=true,输入必须严格为 518 x 518,预处理参数如下:
mean = [0.485, 0.456, 0.406]std = [0.229, 0.224, 0.225]interpolation = bicubiccrop_pct = 1.0批处理大小选择:从性能数据看,批处理大小为 2 时单样本延迟最低(约 5.45 毫秒/图),批处理大小为 8 时总吞吐量最高(266.92 张/秒)。实际部署时可根据业务场景选择。
如希望进一步提升推理性能,可尝试启用 Ascend NPU 的任务队列并行下发:
export TASK_QUEUE_ENABLE=2
python3 inference.py --device npu该优化适用于 host-bound 场景,可减少 CPU 调度开销。
如遇到大 batch size 下的内存瓶颈,可尝试调整 NPU 内存分配器:
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:512
python3 inference.py --device npu --batch-size 16[1, 384],推理正常 ✅from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model('vit_small_patch14_reg4_dinov2.lvd142m', pretrained=True)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'vit_small_patch14_reg4_dinov2.lvd142m',
pretrained=True,
num_classes=0, # remove classifier nn.Linear
)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor
# or equivalently (without needing to set num_classes=0)
output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1374, 384) shaped tensor
output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor在 timm 的 model results 中探索此模型的数据集和运行时指标。
@article{darcet2023vision,
title={Vision Transformers Need Registers},
author={Darcet, Timoth{'e}e and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
journal={arXiv preprint arXiv:2309.16588},
year={2023}
}@misc{oquab2023dinov2,
title={DINOv2: Learning Robust Visual Features without Supervision},
author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
journal={arXiv:2304.07193},
year={2023}
}@article{dosovitskiy2020vit,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
journal={ICLR},
year={2021}
}@misc{rw2019timm,
author = {Ross Wightman},
title = {PyTorch Image Models},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.4414861},
howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}