google/vit-large-patch32-384 是 Google 发布的 ViT-Large 图片分类模型,基于 Vision Transformer 架构。输入分辨率 384x384,patch size 32,参数量 306.6M,输出 ImageNet-1K 1000 类。
本工程将其适配为单卡昇腾 NPU(Ascend910B4)可运行的提交工程,使用官方 AutoImageProcessor 做预处理。
| 项目 | 版本/信息 |
|---|---|
| NPU 型号 | Ascend910B4 (29.5 GB HBM) |
| CANN | 8.5.1 |
| torch | 2.9.0+cpu |
| torch_npu | 2.9.0.post1+gitee7ba04 |
| transformers | 4.57.6 |
| Python | 3.11.14 |
| npu-smi | 25.5.1 |
pip install -r requirements.txt
# torch_npu 通常由昇腾容器提供,无需单独安装# 设置 HuggingFace 镜像
export HF_ENDPOINT=https://hf-mirror.com
# 推理
python inference.py
# 精度验证(CPU vs NPU)
python eval_accuracy.py
# 性能基准测试
python benchmark.pylogs/inference.loglogs/prediction.txtlogs/accuracy.loglogs/benchmark.loglogs/env_check.logtorch.npu.is_available(): True
torch.npu.device_count(): 1
torch.npu.get_device_name(0): Ascend910B4
Top-1: sandbar, sand bar (27.13%)
Top-2: seashore, coast, seacoast, sea-coast (26.09%)
Top-3: sea lion (14.94%)
Top-4: lakeside, lakeshore (8.62%)
Top-5: promontory, headland, head, foreland (6.03%)| 指标 | 值 |
|---|---|
| 平均延迟 | 23.7ms |
| 最小延迟 | 23.5ms |
| 最大延迟 | 24.1ms |
| P50 | 23.7ms |
| P90 | 24.0ms |
| P95 | 24.0ms |
| 吞吐量 | 42.14 images/sec |
[1, 3, 384, 384] (384x384 分辨率)| 指标 | 值 |
|---|---|
| max_abs_diff (logits) | 0.026264 |
| mean_abs_diff (logits) | 0.005554 |
| prob_max_diff | 0.004743 |
| Top-1 match | True |
| Top-5 match | True |
ViTImageProcessor(官方 AutoImageProcessor)见 screenshots/self_verification.png
logs/env_check.log — 环境检查详情logs/inference.log — 推理结果logs/prediction.txt — Top-5 预测logs/accuracy.log — 精度对比logs/benchmark.log — 性能基准AutoImageProcessor(ViTImageProcessor),非手写伪预处理HF_ENDPOINT=https://hf-mirror.com 镜像加速模型下载#NPU