weixin_72661020/vit_small_patch14_reg4_dinov2.lvd142m

vit_small_patch14_reg4_dinov2.lvd142m 模型卡片

带寄存器的视觉Transformer（ViT）图像特征模型。使用自监督DINOv2方法在LVD-142M上进行预训练。

模型详情

模型类型： 图像分类 / 特征骨干网络
模型统计信息：
- 参数（M）：22.1
- GMACs：29.6
- 激活值（M）：57.5
- 图像尺寸：518 x 518
相关论文：
- Vision Transformers Need Registers: https://arxiv.org/abs/2309.16588
- DINOv2: Learning Robust Visual Features without Supervision: https://arxiv.org/abs/2304.07193
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
原始链接： https://github.com/facebookresearch/dinov2
预训练数据集： LVD-142M

模型用途

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_small_patch14_reg4_dinov2.lvd142m', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_small_patch14_reg4_dinov2.lvd142m',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1374, 384) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型对比

在 timm 的 model results 中探索该模型的数据集和运行时指标。

引用

@article{darcet2023vision,
  title={Vision Transformers Need Registers},
  author={Darcet, Timoth{'e}e and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
  journal={arXiv preprint arXiv:2309.16588},
  year={2023}
}

@misc{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
  journal={arXiv:2304.07193},
  year={2023}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}

Ascend NPU 适配

本节记录 vit_small_patch14_reg4_dinov2.lvd142m 在华为昇腾 Ascend910B4 NPU 上的适配与验证过程。

1. 验证环境

组件	版本
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0`
`timm`	`1.0.27`
`transformers`	`4.57.6`
`CANN`	`8.5.1`

NPU：1 张逻辑卡（Ascend910B4）
验证日期：2026-05-09

2. 模型加载与 NPU 适配

由于 timm 默认从 HuggingFace Hub 拉取权重，建议将 HF_ENDPOINT 设置为国内镜像站点：

export HF_ENDPOINT=https://hf-mirror.com

加载模型并将其移至NPU的示例代码：

import torch
import torch_npu
import timm

device = torch.device("npu:0")
torch.npu.set_device(device)

model = timm.create_model(
    "vit_small_patch14_reg4_dinov2.lvd142m",
    pretrained=True
)
model = model.to(device).eval()

# Get model preprocessing config
data_config = timm.data.resolve_model_data_config(model)
input_size = data_config["input_size"]  # (3, 518, 518)

该模型可直接在NPU上运行，无需进行任何额外的算子替换或代码修改。timm的VisionTransformer实现与torch_npu完全兼容。

3. 冒烟验证

3.1 基本推理验证

import torch
import torch_npu
import timm

device = torch.device("npu:0")
model = timm.create_model("vit_small_patch14_reg4_dinov2.lvd142m", pretrained=True)
model = model.to(device).eval()

dummy_input = torch.randn(1, 3, 518, 518).to(device)
with torch.no_grad():
    output = model(dummy_input)

print(f"Output shape: {output.shape}")      # torch.Size([1, 384])
print(f"Output range: [{output.min():.4f}, {output.max():.4f}]")

验证结果：

输出维度：[1, 384]（符合预期）
输出值范围：[-4.2467, 6.4520]
输出均值：-0.0271
无错误，推理成功

3.2 精度对比（NPU 与 CPU）

使用相同的权重和输入，对比 NPU 与 CPU 的输出差异：

指标	数值
最大绝对差值	`0.018992`
平均绝对差值	`0.005008`

结论：NPU 推理结果与 CPU 基准的浮点差异在 1e-2 数量级以内，满足视觉特征提取任务的精度要求。

4. 性能参考

在单张 Ascend910B4 卡上，使用 fp32 精度测试推理吞吐量和延迟（不含预处理）。

批处理大小	吞吐量（样本/秒）	延迟（毫秒/批）
1	94.48	10.58
4	254.01	15.75
8	267.66	29.89
16	253.84	63.03

测试说明：

输入尺寸：3 x 518 x 518
预热迭代次数：10
正式测试迭代次数：100
同步方式：每批后执行 torch.npu.synchronize()

从数据来看，批处理大小为 8 时实现最佳吞吐量（约 267 样本/秒），单张图像平均推理延迟约为 3.7 毫秒。

5. 验证脚本

本验证所使用的完整脚本随代码库一同提供：

python validate_npu.py

该脚本包含以下内容：

NPU设备检测与模型加载
单张/批量推理输出验证（形状、数据类型、数值范围）
NPU与CPU精度对比
多批量大小性能基准测试

6. 注意事项

HF_ENDPOINT：在国内网络环境下，务必设置 export HF_ENDPOINT=https://hf-mirror.com。否则，timm 从 HuggingFace 官方站点下载权重时会超时。
CANN 日志目录：若首次运行 torch_npu 时提示 can not create directory: /home/atomgit/ascend/log，可忽略此提示，不影响推理。若需消除该提示，手动创建该目录即可。
输入尺寸：该模型训练时采用的输入尺寸为 518 x 518，与标准 ViT 的 224 x 224 不同。使用时请注意 input_size 参数。
精度模式：当前验证基于 fp32 精度。如需进一步加速，可尝试通过 torch.npu.set_amp_dtype(torch.float16) 进行混合精度推理，但需额外验证精度损失情况。

7. 文件列表

文件	描述
`model.safetensors`	模型权重（约 85MB）
`validate_npu.py`	NPU 适配验证脚本
`README.md`	适配文档（本文档）
`validation_output.txt`	验证脚本的原始输出日志

vit_small_patch14_reg4_dinov2.lvd142m 模型卡片

带寄存器的视觉Transformer（ViT）图像特征模型。使用自监督DINOv2方法在LVD-142M上进行预训练。

模型详情

模型类型： 图像分类 / 特征骨干网络
模型统计信息：
- 参数（M）：22.1
- GMACs：29.6
- 激活值（M）：57.5
- 图像尺寸：518 x 518
相关论文：
- Vision Transformers Need Registers: https://arxiv.org/abs/2309.16588
- DINOv2: Learning Robust Visual Features without Supervision: https://arxiv.org/abs/2304.07193
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
原始链接： https://github.com/facebookresearch/dinov2
预训练数据集： LVD-142M

模型用途

图像分类

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_small_patch14_reg4_dinov2.lvd142m', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

图像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_small_patch14_reg4_dinov2.lvd142m',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1374, 384) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

模型对比

在 timm 的 model results 中探索该模型的数据集和运行时指标。

引用

@article{darcet2023vision,
  title={Vision Transformers Need Registers},
  author={Darcet, Timoth{'e}e and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
  journal={arXiv preprint arXiv:2309.16588},
  year={2023}
}

@misc{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
  journal={arXiv:2304.07193},
  year={2023}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={ICLR},
  year={2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}

Ascend NPU 适配

本节记录 vit_small_patch14_reg4_dinov2.lvd142m 在华为昇腾 Ascend910B4 NPU 上的适配与验证过程。

1. 验证环境

组件	版本
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0`
`timm`	`1.0.27`
`transformers`	`4.57.6`
`CANN`	`8.5.1`

NPU：1 张逻辑卡（Ascend910B4）
验证日期：2026-05-09

2. 模型加载与 NPU 适配

由于 timm 默认从 HuggingFace Hub 拉取权重，建议将 HF_ENDPOINT 设置为国内镜像站点：

export HF_ENDPOINT=https://hf-mirror.com

加载模型并将其移至NPU的示例代码：

import torch
import torch_npu
import timm

device = torch.device("npu:0")
torch.npu.set_device(device)

model = timm.create_model(
    "vit_small_patch14_reg4_dinov2.lvd142m",
    pretrained=True
)
model = model.to(device).eval()

# Get model preprocessing config
data_config = timm.data.resolve_model_data_config(model)
input_size = data_config["input_size"]  # (3, 518, 518)

该模型可直接在NPU上运行，无需进行任何额外的算子替换或代码修改。timm的VisionTransformer实现与torch_npu完全兼容。

3. 冒烟验证

3.1 基本推理验证

import torch
import torch_npu
import timm

device = torch.device("npu:0")
model = timm.create_model("vit_small_patch14_reg4_dinov2.lvd142m", pretrained=True)
model = model.to(device).eval()

dummy_input = torch.randn(1, 3, 518, 518).to(device)
with torch.no_grad():
    output = model(dummy_input)

print(f"Output shape: {output.shape}")      # torch.Size([1, 384])
print(f"Output range: [{output.min():.4f}, {output.max():.4f}]")

验证结果：

输出维度：[1, 384]（符合预期）
输出值范围：[-4.2467, 6.4520]
输出均值：-0.0271
无错误，推理成功

3.2 精度对比（NPU 与 CPU）

使用相同的权重和输入，对比 NPU 与 CPU 的输出差异：

指标	数值
最大绝对差值	`0.018992`
平均绝对差值	`0.005008`

结论：NPU 推理结果与 CPU 基准的浮点差异在 1e-2 数量级以内，满足视觉特征提取任务的精度要求。

4. 性能参考

在单张 Ascend910B4 卡上，使用 fp32 精度测试推理吞吐量和延迟（不含预处理）。

批处理大小	吞吐量（样本/秒）	延迟（毫秒/批）
1	94.48	10.58
4	254.01	15.75
8	267.66	29.89
16	253.84	63.03

测试说明：

输入尺寸：3 x 518 x 518
预热迭代次数：10
正式测试迭代次数：100
同步方式：每批后执行 torch.npu.synchronize()

从数据来看，批处理大小为 8 时实现最佳吞吐量（约 267 样本/秒），单张图像平均推理延迟约为 3.7 毫秒。

5. 验证脚本

本验证所使用的完整脚本随代码库一同提供：

python validate_npu.py

该脚本包含以下内容：

NPU设备检测与模型加载
单张/批量推理输出验证（形状、数据类型、数值范围）
NPU与CPU精度对比
多批量大小性能基准测试

6. 注意事项

HF_ENDPOINT：在国内网络环境下，务必设置 export HF_ENDPOINT=https://hf-mirror.com。否则，timm 从 HuggingFace 官方站点下载权重时会超时。
CANN 日志目录：若首次运行 torch_npu 时提示 can not create directory: /home/atomgit/ascend/log，可忽略此提示，不影响推理。若需消除该提示，手动创建该目录即可。
输入尺寸：该模型训练时采用的输入尺寸为 518 x 518，与标准 ViT 的 224 x 224 不同。使用时请注意 input_size 参数。
精度模式：当前验证基于 fp32 精度。如需进一步加速，可尝试通过 torch.npu.set_amp_dtype(torch.float16) 进行混合精度推理，但需额外验证精度损失情况。

7. 文件列表

文件	描述
`model.safetensors`	模型权重（约 85MB）
`validate_npu.py`	NPU 适配验证脚本
`README.md`	适配文档（本文档）
`validation_output.txt`	验证脚本的原始输出日志

vit_small_patch14_reg4_dinov2.lvd142m 模型卡片

模型详情

模型用途

图像分类

图像嵌入

模型对比

引用

Ascend NPU 适配

相关链接

1. 验证环境

2. 模型加载与 NPU 适配

3. 冒烟验证

3.1 基本推理验证

3.2 精度对比（NPU 与 CPU）

4. 性能参考

5. 验证脚本

6. 注意事项

7. 文件列表

vit_small_patch14_reg4_dinov2.lvd142m 模型卡片

模型详情

模型用途

图像分类

图像嵌入

模型对比

引用

Ascend NPU 适配

相关链接

1. 验证环境

2. 模型加载与 NPU 适配

3. 冒烟验证

3.1 基本推理验证

3.2 精度对比（NPU 与 CPU）

4. 性能参考

5. 验证脚本

6. 注意事项

7. 文件列表