WebSSL-DINO300M-224 Ascend NPU 部署指南

项目简介

WebSSL-DINO300M-224 是一个 300M 参数的 Vision Transformer (ViT) 模型，使用 DINOv2 自监督学习方法在 20 亿张网页图片上训练。本项目提供其在华为 Ascend NPU 环境下的部署方案。

特性

支持 Ascend NPU 推理加速
CPU vs NPU 精度对比测试
图像特征提取能力
224x224 分辨率支持

环境信息

项目	版本/内容
设备	Ascend 910B

目录结构

webssl-dino300m-full2b-224-ascend/
├── inference.py          # 精度测试脚本
├── test.log               # 测试日志
├── README.md             # 本文档

部署步骤

1. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

2. 准备模型文件

模型文件应放在 webssl-dino300m-full2b-224/ 目录下：

config.json - 模型配置
preprocessor_config.json - 预处理器配置
model.safetensors - 模型权重

4. 执行精度测试

cd webssl-dino300m-full2b-224-ascend/
python3 inference.py

测试验证

精度测试结果

指标	实测值	阈值	状态
Max Error (sum)	6.11e-02	< 1.00e-01	PASS
Max Error (mean)	1.78e-05	< 1.00e-04	PASS
Max Error (std)	9.81e-05	< 1.00e-04	PASS

性能数据

操作	耗时
模型加载	7.59s
CPU 参考计算 (20 tensors)	0.90s
NPU 推理 (20 tensors)	0.09s
图像推理 (224x224)	4.86s

测试日志

完整测试日志保存在 test.log

使用示例

运行推理

import torch
from PIL import Image
from transformers import AutoImageProcessor, Dinov2Model

model_path = "webssl-dino300m-full2b-224"
device = torch.device("npu:0")

model = Dinov2Model.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to(device).eval()

processor = AutoImageProcessor.from_pretrained(model_path)

image = Image.open("path/to/image.jpg")
inputs = processor(images=image, return_tensors="pt")
pixel_values = inputs["pixel_values"].to(device)

with torch.no_grad():
    outputs = model(pixel_values=pixel_values)
    cls_features = outputs.last_hidden_state[:, 0]
    patch_features = outputs.last_hidden_state[:, 1:]
    print(f"CLS features shape: {cls_features.shape}")
    print(f"Patch features shape: {patch_features.shape}")

处理器调用说明

BitImageProcessor 的调用方式:

inputs = processor(
    images=image,      # PIL Image 或 numpy array
    return_tensors="pt"
)

模型结构

模型主要组件:

组件	说明
embeddings	图像嵌入 (cls_token, position_embeddings, patch_embeddings)
encoder	Transformer 编码器 (24层)
output	特征输出

常见问题

Q: 精度测试失败?

A: 检查 NPU 驱动是否正确安装, 确保 CANN 环境变量已 source。

Q: 支持哪些图像格式?

A: 支持 PIL Image 支持的所有格式, 包括 JPEG, PNG, RGB 等。

Q: 推理时间较长?

A: DINOv2 模型较大 (300M 参数), 首次推理需要约 5 秒。后续推理会使用缓存。

许可证

本项目遵循 WebSSL-DINO300M 原始许可证 (cc-by-nc-4.0)。