本文档记录 C-RADIOv2-B 在 昇腾 NPU 上的适配与验证结果。
C-RADIOv2-B 是 NVIDIA 发布的视觉基础模型(Vision Foundation Model),基于 Vision Transformer,用于提取图像的 summary embedding 与 spatial feature,可应用于图像分类、语义分割、深度估计及多模态 LLM 等下游任务。
| 组件 | 版本 |
|---|---|
| Python | 3.11.14 |
| torch | 2.9.0+cpu |
| torch-npu | 2.9.0.post1+gitee7ba04 |
| transformers | 4.50.0 |
| timm | 1.0.27 |
| Pillow | >=9.0 |
| CANN | 8.5.1 |
python3 -m atomgit download hf_mirrors/nvidia/C-RADIOv2-B -d /opt/atomgit/weight/C-RADIOv2-Bmodelscope download --model nv-community/C-RADIOv2-B --local_dir /opt/atomgit/weight/C-RADIOv2-Bpip install torch==2.9.0 transformers timm Pillow注意:若已安装
torch-npu,请确保torch与torch-npu版本一致(本文档使用 2.9.0)。
由于本模型使用了 trust_remote_code=True 加载自定义架构,需将模型目录下的 Python 文件复制到 HuggingFace 模块缓存目录,以便 transformers 正确加载:
mkdir -p ~/.cache/huggingface/modules/transformers_modules/C-RADIOv2-B
cp /opt/atomgit/weight/C-RADIOv2-B/*.py ~/.cache/huggingface/modules/transformers_modules/C-RADIOv2-B/python inference.pyfrom PIL import Image
from transformers import AutoModel, CLIPImageProcessor
hf_repo = "/opt/atomgit/weight/C-RADIOv2-B"
device = "npu"
image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().to(device)
print(f"Load model from {hf_repo}, device: {model.device}")
image = Image.open("./assets/demo.png").convert("RGB")
# resize to height=432, width=704
image = image.resize((704, 432))
pixel_values = image_processor(images=image, return_tensors="pt", do_resize=True).pixel_values
pixel_values = pixel_values.to(device)
summary, features = model(pixel_values)
print(f"{summary.shape=}\n{features.shape=}")Load model from /opt/atomgit/weight/C-RADIOv2-B, device: npu:0
summary.shape=torch.Size([1, 2304])
features.shape=torch.Size([1, 1188, 768])模型成功在 NPU 上完成推理,输出 summary embedding 维度为 [1, 2304],spatial feature 维度为 [1, 1188, 768]。
python benchmark.py --batch_size 1 --iterations 100 --warmup 10import time
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
hf_repo = "/opt/atomgit/weight/C-RADIOv2-B"
device = "npu"
image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().to(device)
image = Image.open("./assets/demo.png").convert("RGB").resize((704, 432))
pixel_values = image_processor(images=image, return_tensors="pt", do_resize=True).pixel_values.to(device)
# Warmup
for _ in range(10):
with torch.no_grad():
_ = model(pixel_values)
torch.npu.synchronize()
# Benchmark
start = time.perf_counter()
for _ in range(100):
with torch.no_grad():
_ = model(pixel_values)
torch.npu.synchronize()
end = time.perf_counter()
latency = (end - start) / 100 * 1000
throughput = 100 / (end - start)
print(f"Latency: {latency:.2f} ms")
print(f"Throughput: {throughput:.2f} images/s")| 指标 | 数值 |
|---|---|
| Batch size | 1 |
| Iterations | 100 |
| Total time | 0.791 s |
| Average latency | 7.910 ms |
| Throughput | 126.42 images/s |
python accuracy.py以 CPU (float32) 为基准,对比 NPU (float32) 输出:
| 指标 | Summary | Features |
|---|---|---|
| 最大绝对误差 | 1.09e-02 | 1.61e-02 |
| 平均绝对误差 | 1.98e-03 | 2.07e-03 |
| L2 相对误差 | 0.8667% | 0.7339% |
| 余弦相似度 | 0.999964 | 0.999991 |
结论: NPU 与 CPU 输出的 L2 相对误差均 < 1%,余弦相似度均 > 0.9999,精度验证 PASS。
inference.py — 推理脚本benchmark.py — 性能评测脚本accuracy.py — 精度验证脚本README.md — 适配文档(含 NPU 标签)output/inference_log.txt — 推理运行日志output/benchmark_log.txt — 性能评测日志output/accuracy_log.txt — 精度验证日志assets/demo.png — 测试样例图片trust_remote_code=True 加载自定义 ViT 架构,加载前需确保 transformers_modules 缓存目录中包含模型仓库的 .py 文件。get_nearest_supported_resolution 获取合法分辨率。torch.compile,无需设置 TORCH_COMPILE_DISABLE。torch.npu.synchronize() 确保 NPU 计算完成后再计时。