MiniMax-AI/VTP-Base-f16d64 on Ascend NPU

1. 简介

本文档记录 MiniMax-AI/VTP-Base-f16d64 在华为昇腾 NPU（Ascend 910B）环境的适配与验证结果。

VTP-Base-f16d64 是 MiniMax 提出的视觉分词器预训练模型（Visual Tokenizer Pre-training），基于 Vision Transformer 架构，可同时支持：

对比学习（CLIP/SigLIP 风格的图像-文本对齐）
自监督学习（DINOv2 风格的特征提取）
图像重建（Auto-encoder 风格的分词器）

模型规模约 295.6M 参数，输入分辨率 256×256，在 ImageNet 上 zero-shot 精度 73.2，重建 rFID 0.74。

2. 验证环境

组件	版本
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`transformers`	`4.55.4`
`torchvision`	`0.20.0+cpu`
`timm`	`1.0.27`
`omegaconf`	`2.3.0`

NPU：1 逻辑卡（Ascend 910B4）
模型路径：./VTP-Base-f16d64
CANN 版本：8.5.1

3. 环境准备

3.1 安装依赖

pip install -U atomgit
pip install torch torchvision transformers timm omegaconf
# torch-npu 与 CANN 驱动需预先安装

3.2 下载模型权重

from atomgit_hub import snapshot_download
snapshot_download("MiniMax-AI/VTP-Base-f16d64", local_dir="./VTP-Base-f16d64")

3.3 获取官方推理代码

git clone https://github.com/MiniMax-AI/VTP.git
export PYTHONPATH="${PYTHONPATH}:$(pwd)/VTP"

4. 模型推理

昇腾 NPU 适配的核心修改点：

将模型与输入张量移动到 npu 设备
原官方示例中的 torch.autocast("cuda") 在昇腾环境可直接去掉或保持默认
torch_npu 对标准 PyTorch 算子做了自动映射，无需修改模型结构

已验证通过的推理脚本：

import torch
import torch_npu
from PIL import Image
from torchvision import transforms

from vtp.models.vtp_hf import VTPConfig, VTPModel
from vtp.tokenizers import get_tokenizer

MODEL_PATH = "./VTP-Base-f16d64"
DEVICE = "npu:0"

model = VTPModel.from_pretrained(MODEL_PATH).to(DEVICE).eval()

preprocess = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
image = preprocess(Image.open("dog.png")).unsqueeze(0).to(DEVICE)

# ---------- 1. 图像重建 ----------
with torch.no_grad():
    latents = model.get_reconstruction_latents(image)
    recon = model.get_latents_decoded_images(latents)
print("Latents shape:", latents.shape)  # [1, 64, 16, 16]

# ---------- 2. CLIP 零样本分类 ----------
tokenizer = get_tokenizer('ViT-B-32', context_length=model.config.text_context_length)
text = tokenizer(["a diagram", "a dog", "a cat", "a person"]).to(DEVICE)
with torch.no_grad():
    image_features = model.get_clip_image_feature(image, normalize=True)
    text_features = model.get_clip_text_feature(text, normalize=True)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", [f"{p:.4f}" for p in text_probs[0].tolist()])

# ---------- 3. SSL 特征提取 ----------
with torch.no_grad():
    features = model.get_last_layer_feature(image)
    cls_token = features['cls_token']       # [1, 768]
    patch_tokens = features['patch_tokens'] # [1, 256, 768]
print("CLS shape:", cls_token.shape)

5. Smoke 验证

验证结果：

模型可正常加载到 NPU
get_reconstruction_latents / get_latents_decoded_images 返回正常 shape
get_clip_image_feature / get_clip_text_feature 零样本分类概率分布合理（狗图片中 "a dog" 概率最高，约 99.8%）
get_last_layer_feature 可正常提取 cls_token 与 patch_tokens
get_intermediate_layers_feature 多层级特征提取正常

6. 性能参考

测试条件：单张 256×256 输入，warmup=5，iterations=20，取平均值。

任务	耗时（mean）	耗时（median）	耗时（p99）
`reconstruction_encode`	`26.564 ms`	`26.468 ms`	`27.765 ms`
`reconstruction_decode`	`17.277 ms`	`17.165 ms`	`17.900 ms`
`clip_image_feature`	`25.672 ms`	`25.701 ms`	`26.071 ms`
`clip_text_feature`	`13.529 ms`	`13.542 ms`	`13.674 ms`
`ssl_last_layer`	`25.356 ms`	`25.290 ms`	`25.869 ms`
`ssl_intermediate_4`	`25.907 ms`	`25.796 ms`	`26.492 ms`

显存占用：

模型加载后预留显存：约 1280 MB
实际分配显存：约 1132 MB

8. 注意事项

设备迁移：原官方示例使用 torch.autocast("cuda")，在昇腾 NPU 上建议直接去掉 autocast 或改为默认精度，因为 torch_npu 已自动适配算子。
首次推理延迟：由于 CANN 图编译机制，首次前向推理耗时较长（约 10~20s），后续推理稳定在毫秒级。这是正常的昇腾图编译行为，非性能问题。
依赖版本：transformers 版本建议 >= 4.50，过低版本可能无法识别 VTPModel 自定义架构。
模型权重格式：AtomGit 下载的权重为 model.safetensors，加载时 transformers 会自动解析，无需额外转换。
多卡扩展：当前验证基于单卡 NPU。如需多卡并行，可参考 torch.nn.DataParallel 或 torch.distributed，昇腾 NPU 与 PyTorch 分布式接口兼容。

Ascend NPU 精度评测

NPU vs CPU 精度对比（CPU 为基线，NPU 为验证目标）：

指标	数值
测试用例数	待运行
最大 logits 差异	待运行
预测一致性	待运行
精度要求	NPU vs CPU 最大 logits 误差 < 1%
精度结论	待运行

精度评测源代码和日志详见 eval/ 目录。