facebook/dinov2-large on Ascend NPU

本项目完成了 facebook/dinov2-large 在昇腾 NPU 上的适配与验证，包括推理脚本、性能评测和精度对比。

1. 简介

DINOv2 是 Meta 发布的自监督视觉 Transformer 模型，可用于图像特征提取、密集预测等下游任务。dinov2-large 为 ViT-L/14 架构，参数量约 3 亿。

本项目基于 transformers 库，通过 model.to("npu") 将模型迁移至昇腾 NPU 执行推理，并通过 Monkey-patch npu_gelu(approximate='none') 解决了默认 GELU 在 NPU 上的精度偏差问题，将 NPU vs GPU 误差从 1.33% 降至 0.037%。

2. 验证环境

组件	版本
PyTorch	2.9.0+cpu
torch-npu	2.9.0.post1+gitee7ba04
transformers	最新
Python	3.11
CANN	8.5.1
NPU 数量	2

3. 权重下载

方式一：AtomGit（推荐）

python3 -m atomgit download hf_mirrors/facebook/dinov2-large -d /opt/atomgit/weights/facebook/dinov2-large

方式二：ModelScope

modelscope download --model facebook/dinov2-large --local_dir /opt/atomgit/weights/facebook/dinov2-large

原始权重来源：facebook/dinov2-large

4. 环境依赖

pip install torch torch-npu transformers Pillow

若 pip install 后出现 vllm-ascend 兼容性警告，而当前项目不依赖 vllm-ascend，可忽略。

5. 推理验证

样例输入

以下为本次推理使用的样例图片，通过 wget 从 COCO 验证集下载：

wget http://images.cocodataset.org/val2017/000000039769.jpg -O assets/000000039769.jpg

样例输入图片

推理脚本

inference.py 完整代码如下：

import torch
import torch.nn as nn
import torch_npu
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

device = "npu"
model_name = "/opt/atomgit/weights/facebook/dinov2-large"
url = "./assets/000000039769.jpg"
image = Image.open(url)

processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

model.eval()
model = model.to(device)
print(f"Model loaded to device({device})")

# Monkey-patch: replace MLP GELU with npu_gelu(approximate='none') for higher precision
class NPUGELU_none(nn.Module):
    def forward(self, x):
        return torch_npu.npu_gelu(x, approximate='none')

for layer in model.encoder.layer:
    layer.mlp.activation = NPUGELU_none()
print("Applied npu_gelu(approximate='none') patch to all MLP layers")

inputs = processor(images=image, return_tensors="pt").to(device)
inputs = {k: v.to(torch.float32) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
cls_token = last_hidden_states[:, 0, :]  # [1, 1024]
patch_tokens = last_hidden_states[:, 1:, :]  # [1, 256, 1024]
print(f"{last_hidden_states.shape=}")
print(f"{cls_token.shape=}")
print(f"{patch_tokens.shape=}")

# Save result for accuracy comparison
torch.save(last_hidden_states.cpu(), "output/npu_result.pt")
print("NPU result saved to output/npu_result.pt")

运行方式

python3 inference.py

运行输出示例

Model loaded to device(npu)
Applied npu_gelu(approximate='none') patch to all MLP layers
last_hidden_states.shape=torch.Size([1, 257, 1024])
cls_token.shape=torch.Size([1, 1024])
patch_tokens.shape=torch.Size([1, 256, 1024])
NPU result saved to output/npu_result.pt

模型成功加载到 NPU 并完成图像特征提取。输出 last_hidden_state 形状为 [1, 257, 1024]，其中 CLS token 为 [1, 1024]，patch tokens 为 [1, 256, 1024]。

6. 性能评测

运行 benchmark.py：

python3 benchmark.py

性能结果

每个 batch size 均 warmup 10 次、运行 100 次，统计 p50/p90/p99 单图延迟：

Batch Size	p50 (ms)	p90 (ms)	p99 (ms)	吞吐率 (img/s)
1	12.754	12.978	13.175	78.51
2	6.493	6.607	7.294	153.23
4	4.566	4.573	4.587	218.98
8	4.081	4.088	4.101	245.02

7. 精度评测

运行 accuracy.py，对比 NPU 与 GPU 推理输出：

python3 accuracy.py

评测指标定义

本次精度对比采用相对误差（Relative Error），计算公式为：

Relative Error = mean(|output_gpu - output_npu|) / mean(|output_gpu|) * 100%

其中：

output_gpu 为 GPU 推理输出（float32），保存在 assets/gpu_result.pt
output_npu 为 NPU 推理输出转回 CPU 后的结果（float32）
分别对 CLS token 和 patch tokens 独立计算

精度优化说明

默认 transformers 库使用 torch.nn.functional.gelu，其在昇腾 NPU 上的映射实现存在精度偏差，导致深层 Transformer 误差逐层累积。本项目通过 Monkey-patch 将所有 MLP 的 GELU 替换为 torch_npu.npu_gelu(x, approximate='none')，利用基于 erf 的高精度实现，显著降低了 NPU 与 GPU 之间的精度差异。

| 指标 | 指标 | 优化前（默认 gelu） | 优化后（npu_gelu none） | 提升倍数 | |------|---------------------|-------------------------|----------| | CLS token 误差 | 1.3325% | 0.0251% | ~53x | | Patch tokens 误差 | 0.8591% | 0.0369% | ~23x | | 最大误差 | 1.3325% | 0.0369% | ~36x |

精度结果

对比项	相对误差
CLS token	0.025075%
Patch tokens	0.036940%
最大误差	0.036940%

结论：精度验证通过（最大误差 < 1%）

8. 注意事项

图像样本通过 wget 从 COCO 验证集下载，未使用随机生成或编造数据。
推理时若出现 use_fast 相关提示，可按照提示在 AutoImageProcessor.from_pretrained 中显式传入 use_fast=False 或 use_fast=True。
本项目未使用 torch.compile，无需设置 TORCH_COMPILE_DISABLE 环境变量。
npu_gelu 的 approximate='tanh' 模式无精度改善效果，必须使用 'none' 模式。
若需查看详细运行日志，请参考 output/ 目录下的 inference.log、benchmark.log 和 accuracy.log。

facebook/dinov2-large on Ascend NPU

本项目完成了 facebook/dinov2-large 在昇腾 NPU 上的适配与验证，包括推理脚本、性能评测和精度对比。

原始权重来源：facebook/dinov2-large

1. 简介

DINOv2 是 Meta 发布的自监督视觉 Transformer 模型，可用于图像特征提取、密集预测等下游任务。dinov2-large 为 ViT-L/14 架构，参数量约 3 亿。

2. 验证环境

组件	版本
PyTorch	2.9.0+cpu
torch-npu	2.9.0.post1+gitee7ba04
transformers	最新
Python	3.11
CANN	8.5.1
NPU 数量	2

3. 权重下载

方式一：AtomGit（推荐）

python3 -m atomgit download hf_mirrors/facebook/dinov2-large -d /opt/atomgit/weights/facebook/dinov2-large

方式二：ModelScope

modelscope download --model facebook/dinov2-large --local_dir /opt/atomgit/weights/facebook/dinov2-large

原始权重来源：facebook/dinov2-large

4. 环境依赖

pip install torch torch-npu transformers Pillow

若 pip install 后出现 vllm-ascend 兼容性警告，而当前项目不依赖 vllm-ascend，可忽略。

5. 推理验证

样例输入

以下为本次推理使用的样例图片，通过 wget 从 COCO 验证集下载：

wget http://images.cocodataset.org/val2017/000000039769.jpg -O assets/000000039769.jpg

样例输入图片

推理脚本

inference.py 完整代码如下：

import torch
import torch.nn as nn
import torch_npu
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

device = "npu"
model_name = "/opt/atomgit/weights/facebook/dinov2-large"
url = "./assets/000000039769.jpg"
image = Image.open(url)

processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

model.eval()
model = model.to(device)
print(f"Model loaded to device({device})")

# Monkey-patch: replace MLP GELU with npu_gelu(approximate='none') for higher precision
class NPUGELU_none(nn.Module):
    def forward(self, x):
        return torch_npu.npu_gelu(x, approximate='none')

for layer in model.encoder.layer:
    layer.mlp.activation = NPUGELU_none()
print("Applied npu_gelu(approximate='none') patch to all MLP layers")

inputs = processor(images=image, return_tensors="pt").to(device)
inputs = {k: v.to(torch.float32) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
cls_token = last_hidden_states[:, 0, :]  # [1, 1024]
patch_tokens = last_hidden_states[:, 1:, :]  # [1, 256, 1024]
print(f"{last_hidden_states.shape=}")
print(f"{cls_token.shape=}")
print(f"{patch_tokens.shape=}")

# Save result for accuracy comparison
torch.save(last_hidden_states.cpu(), "output/npu_result.pt")
print("NPU result saved to output/npu_result.pt")

运行方式

python3 inference.py

运行输出示例

Model loaded to device(npu)
Applied npu_gelu(approximate='none') patch to all MLP layers
last_hidden_states.shape=torch.Size([1, 257, 1024])
cls_token.shape=torch.Size([1, 1024])
patch_tokens.shape=torch.Size([1, 256, 1024])
NPU result saved to output/npu_result.pt

模型成功加载到 NPU 并完成图像特征提取。输出 last_hidden_state 形状为 [1, 257, 1024]，其中 CLS token 为 [1, 1024]，patch tokens 为 [1, 256, 1024]。

6. 性能评测

运行 benchmark.py：

python3 benchmark.py

性能结果

每个 batch size 均 warmup 10 次、运行 100 次，统计 p50/p90/p99 单图延迟：

Batch Size	p50 (ms)	p90 (ms)	p99 (ms)	吞吐率 (img/s)
1	12.754	12.978	13.175	78.51
2	6.493	6.607	7.294	153.23
4	4.566	4.573	4.587	218.98
8	4.081	4.088	4.101	245.02

7. 精度评测

运行 accuracy.py，对比 NPU 与 GPU 推理输出：

python3 accuracy.py

评测指标定义

本次精度对比采用相对误差（Relative Error），计算公式为：

Relative Error = mean(|output_gpu - output_npu|) / mean(|output_gpu|) * 100%

其中：

output_gpu 为 GPU 推理输出（float32），保存在 assets/gpu_result.pt
output_npu 为 NPU 推理输出转回 CPU 后的结果（float32）
分别对 CLS token 和 patch tokens 独立计算

精度优化说明

精度结果

对比项	相对误差
CLS token	0.025075%
Patch tokens	0.036940%
最大误差	0.036940%

结论：精度验证通过（最大误差 < 1%）

8. 注意事项

图像样本通过 wget 从 COCO 验证集下载，未使用随机生成或编造数据。
推理时若出现 use_fast 相关提示，可按照提示在 AutoImageProcessor.from_pretrained 中显式传入 use_fast=False 或 use_fast=True。
本项目未使用 torch.compile，无需设置 TORCH_COMPILE_DISABLE 环境变量。
npu_gelu 的 approximate='tanh' 模式无精度改善效果，必须使用 'none' 模式。
若需查看详细运行日志，请参考 output/ 目录下的 inference.log、benchmark.log 和 accuracy.log。