本项目完成了 facebook/dino-vits8 在昇腾 NPU 上的适配与验证,包括推理脚本、性能评测和精度对比。
DINO (Emerging Properties in Self-Supervised Vision Transformers) 是 Meta 发布的自监督视觉 Transformer 训练方法。dino-vits8 为 ViT-S/8 架构,在 ImageNet-1k 上自监督训练,可用于图像特征提取等下游任务。
本项目基于 transformers 库,通过 model.to("npu") 将模型迁移至昇腾 NPU 执行推理,无需修改第三方库源码。
| 组件 | 版本 |
|---|---|
| PyTorch | 2.9.0+cpu |
| torch-npu | 2.9.0.post1+gitee7ba04 |
| transformers | 最新 |
| Python | 3.11 |
| CANN | 8.5.1 |
| NPU 数量 | 2 |
python3 -m atomgit download hf_mirrors/facebook/dino-vits8 -d /opt/atomgit/weights/facebook/dino-vits8modelscope download --model facebook/dino-vits8 --local_dir /opt/atomgit/weights/facebook/dino-vits8原始权重来源:facebook/dino-vits8
pip install torch torch-npu transformers Pillow若 pip install 后出现
vllm-ascend兼容性警告,而当前项目不依赖 vllm-ascend,可忽略。
以下为本次推理使用的样例图片,通过 wget 从 COCO 验证集下载:
wget http://images.cocodataset.org/val2017/000000039769.jpg -O assets/test_image.jpg
inference.py 完整代码如下:
import torch
from PIL import Image
from transformers import ViTImageProcessor, ViTModel
device = "npu"
model_name = "/opt/atomgit/weights/facebook/dino-vits8"
url = "./assets/test_image.jpg"
image = Image.open(url)
processor = ViTImageProcessor.from_pretrained(model_name)
model = ViTModel.from_pretrained(model_name)
model.eval()
model = model.to(device)
print(f"Model loaded to device({device})")
inputs = processor(images=image, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
cls_token = last_hidden_state[:, 0, :] # [1, 384]
patch_tokens = last_hidden_state[:, 1:, :] # [1, 784, 384]
print(f"{last_hidden_state.shape=}")
print(f"{cls_token.shape=}")
print(f"{patch_tokens.shape=}")python3 inference.pySome weights of ViTModel were not initialized from the model checkpoint at /opt/atomgit/weights/facebook/dino-vits8 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model loaded to device(npu)
last_hidden_state.shape=torch.Size([1, 785, 384])
cls_token.shape=torch.Size([1, 384])
patch_tokens.shape=torch.Size([1, 784, 384])模型成功加载到 NPU 并完成图像特征提取。输出形状为 [1, 785, 384],其中 CLS token 为 [1, 384],patch tokens 为 [1, 784, 384]。
pooler.dense权重未初始化的警告属于正常现象,DINO 预训练权重不包含 pooler 层。
运行 benchmark.py:
python3 benchmark.py每个 batch size 均运行 100 次,结果如下:
| Batch Size | 总耗时 (s) | 单图延迟 (ms/img) | 吞吐率 (img/s) |
|---|---|---|---|
| 1 | 0.575 | 5.748 | 173.96 |
| 2 | 0.579 | 2.894 | 345.58 |
| 4 | 0.764 | 1.911 | 523.32 |
| 8 | 1.234 | 1.543 | 648.17 |
单图延迟 = 总耗时 / 运行次数 / batch_size * 1000
吞吐率 = 运行次数 * batch_size / 总耗时
运行 accuracy.py,对比 NPU 与 CPU 推理输出:
python3 accuracy.py本次精度对比采用相对误差(Relative Error),计算公式为:
Relative Error = mean(|output_cpu - output_npu|) / mean(|output_cpu|) * 100%其中:
output_cpu 为 CPU 推理输出(float32)output_npu 为 NPU 推理输出转回 CPU 后的结果(float32)CLS token 和 patch tokens 独立计算| 对比项 | 相对误差 |
|---|---|
| CLS token | 0.332245% |
| Patch tokens | 0.339284% |
| 最大误差 | 0.339284% |
结论:精度验证通过(最大误差 < 1%)
wget 从 COCO 验证集 下载,未使用随机生成或编造数据。pooler.dense 权重未初始化的警告,属于正常现象(DINO 预训练权重不包含 pooler 层)。torch.compile,无需设置 TORCH_COMPILE_DISABLE 环境变量。output/ 目录下的 inference.log、benchmark.log 和 accuracy.log。