MiniMax VTP-Small-f16d64 Ascend NPU 适配

本仓库包含基于 Vision Transformer (ViT) 的视觉 tokenizer 预训练模型 MiniMax-AI/VTP-Small-f16d64 的 Ascend NPU 适配。

模型概述

属性	值
模型	VTP-Small-f16d64
参数量	167.2M
架构	Vision Transformer + 像素解码器 + 文本 Transformer
输入尺寸	256 x 256
Patch 尺寸	16
瓶颈维度	64
原始来源	MiniMax-AI/VTP

Ascend NPU 兼容性

此适配使 VTP 模型能够使用 torch_npu 在 华为 Ascend NPU（如 Ascend 910B4）上运行。

代码变更

仅修改了 1 个文件以实现 NPU 兼容性：

vtp/models/utils/utils.py
- 将 torch.cuda.manual_seed_all(seed) 替换为支持 CUDA 和 Ascend NPU 的设备无关版本。

# Before (CUDA only)
torch.cuda.manual_seed_all(seed)

# After (CUDA + NPU compatible)
if hasattr(torch, 'cuda') and torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)
if hasattr(torch, 'npu') and torch.npu.is_available():
    torch.npu.manual_seed_all(seed)

验证环境

组件	版本
硬件	Ascend 910B4 (32GB HBM)
CANN	8.5.1
PyTorch	2.9.0+cpu
torch_npu	2.9.0.post1

快速开始

1. 安装依赖项

pip install torch==2.9.0 torch_npu==2.9.0.post1 transformers omegaconf

2. 克隆原始模型代码

git clone https://github.com/MiniMax-AI/VTP.git vtp_repo

3. 应用NPU补丁

将 vtp_repo/vtp/models/utils/utils.py 替换为本仓库中的版本。

4. 下载模型权重

from atomgit_hub import snapshot_download
snapshot_download("MiniMax-AI/VTP-Small-f16d64", local_dir="./VTP-Small-f16d64")

5. 在NPU上运行推理

import sys
sys.path.insert(0, "./vtp_repo")

import torch
import torch_npu
from vtp.models.vtp_hf import VTPConfig, VTPModel

# Device
device = torch.device("npu:0")

# Load model
model_path = "./VTP-Small-f16d64"
config = VTPConfig.from_pretrained(model_path)
model = VTPModel.from_pretrained(model_path, config=config)
model = model.to(device).eval()

# Prepare input
dummy_image = torch.randn(1, 3, 256, 256).to(device)

# Feature extraction
features = model.get_last_layer_feature(dummy_image)
print("cls_token:", features["cls_token"].shape)
print("patch_tokens:", features["patch_tokens"].shape)

# Reconstruction
latents = model.get_reconstruction_latents(dummy_image)
reconstructed = model.get_latents_decoded_images(latents)
print("reconstructed:", reconstructed.shape)

# CLIP image feature
img_feat = model.get_clip_image_feature(dummy_image)
print("clip_feature:", img_feat.shape)

性能基准测试（Ascend 910B4）

功能	延迟（BS=1）	吞吐量
`get_last_layer_feature`	25.46 ms	39.3 img/s
`get_reconstruction_latents`	27.27 ms	36.7 img/s
`full_reconstruction`	66.77 ms	15.0 img/s
`get_clip_image_feature`	50.18 ms	19.9 img/s

批处理大小扩展

批处理大小	延迟	总吞吐量
1	58.37 ms	17.1 img/s
2	52.14 ms	38.4 img/s
4	59.74 ms	67.0 img/s
8	58.85 ms	135.9 img/s

内存使用情况

指标	值
模型权重	~645 MB
推理增量	~131 MB
总NPU内存	~777 MB
HBM利用率	2.4%

验证

所有核心功能均已在Ascend NPU上完成验证：

get_last_layer_feature - SSL特征提取
get_intermediate_layers_feature - 多层特征
get_reconstruction_latents - 瓶颈隐向量
get_latents_decoded_images - 图像重建
get_clip_image_feature - CLIP图像编码
forward（feature / rec模式）- 统一接口

完整验证报告详见 VTP-Small-f16d64-NPU-Report.md。

仓库结构

.
├── vtp/                              # Adapted model source code
│   ├── models/
│   │   ├── vtp_hf/                   # HuggingFace-compatible model
│   │   ├── encoders/                 # Vision & text encoders
│   │   ├── decoders/                 # Pixel decoder
│   │   ├── layers/                   # Attention, FFN, norms
│   │   └── utils/
│   │       └── utils.py              # NPU-compatible random seed (KEY CHANGE)
│   └── ...
├── VTP-Small-f16d64-NPU-Report.md    # Detailed verification report
├── vtp_npu_verify.py                 # Functional verification script
├── vtp_npu_benchmark.py              # Performance benchmark script
└── vtp_npu_benchmark_results.json    # Benchmark raw data

参考文献

许可证

本适配版本遵循与原始 VTP 模型相同的许可证。

昇腾 NPU 精度评测

NPU 与 CPU 精度对比（CPU 作为基线，NPU 作为验证目标）：

指标	数值
测试用例数	6（特征提取/重建/CLIP/中间层等）
最大 logits 差异	0.0（确定性检查 max_diff=0.0 ✅ 完全相同输入输出一致）
预测一致性	6/6 功能测试全部通过，CLIP 特征 L2 归一化 = 1.0000 ✅
精度要求	NPU 与 CPU 最大 logits 误差 < 1%（尚无 CPU 基线对比）
精度结论	✅ 推理功能与数值正确性通过（NPU 确定性验证 + 6 项功能测试全部 PASS）

精度评测源代码和日志详见 eval/ 目录。