VTP-Small-f16d64 on Ascend NPU

1. 简介

本文档记录 MiniMax VTP-Small-f16d64 视觉 Tokenizer 模型在华为昇腾 NPU（Ascend 910B）上的推理适配与验证结果。

VTP（Visual Tokenizer Pre-training）是 MiniMax 海螺视频团队开源的视觉生成模型预训练框架，支持：

CLIP/SigLIP 风格的图文对比学习
DINOv2 风格的自监督学习
图像重建（Reconstruction）

2. 验证环境

组件	版本
`torch`	`2.9.0+cpu`
`torch-npu`	`2.9.0.post1+gitee7ba04`
`transformers`	`4.57.6`
`cann`	`8.5.1`

NPU：1 逻辑卡（Ascend 910B4，32GB HBM）
模型路径：./model_weights/MiniMax/VTP-Small-f16d64

3. 快速推理

3.1 环境准备

# 安装依赖
pip install torch torchvision torch_npu transformers \
    omegaconf timm scipy torchmetrics pytorch-fid tqdm pillow modelscope

# 下载权重
python3 -c "from modelscope import snapshot_download; snapshot_download('MiniMax/VTP-Small-f16d64', cache_dir='./model_weights')"

# 克隆 VTP 官方代码
git clone https://github.com/MiniMax-AI/VTP.git vtp-repo

3.2 单卡推理验证

python3 inference.py \
    --model_path ./model_weights/MiniMax/VTP-Small-f16d64

验证结果：

模型加载成功
CLIP 图像特征提取成功
图像重建成功
NPU vs CPU 精度对比通过（归一化最大误差 < 1%）

4. Smoke 验证

基础检查（单卡 fp32）：

python3 inference.py --model_path ./model_weights/MiniMax/VTP-Small-f16d64 --quick

验证现象：

设备自动检测为 npu:0
img_feat 输出 shape: [1, 768]
rec_latents 输出 shape: [1, 64, 16, 16]
rec_image 输出 shape: [1, 3, 256, 256]
推理耗时约 450 ms

5. 性能参考

测试条件：batch_size=1 / fp32 / 256x256 输入，连续 3 次 warmup + 3 次正式运行。

设备	单次推理延迟	加速比
Ascend 910B (NPU)	`~450 ms`	1x
CPU (ARM)	`~2000 ms`	~4.5x

6. 精度评测

使用相同随机种子，对 NPU 与 CPU 输出进行逐像素对比。

输出项	归一化最大误差	归一化平均误差	PSNR	结果
`img_feat` (CLIP 图像特征)	`0.031%`	`0.004%`	-	通过
`rec_latents` (重建隐变量)	`0.44%`	`0.042%`	-	通过
`rec_image` (重建图像)	`0.77%`	`0.047%`	57.49 dB	通过

精度验证结论：所有输出归一化最大误差均 < 1%，图像重建 PSNR 高达 57.49 dB，满足视觉无损要求。

7. NPU 适配说明

7.1 自动设备检测

npu_compat.py 提供统一的设备检测接口，自动优先选择 NPU：

from npu_compat import get_device, adapt_model_for_npu

device = get_device()  # npu:0

7.2 关键适配点

RoPE 精度修复：pixel_decoder 中的 RopePositionEmbedding 默认使用 bfloat16，NPU/CPU 数值行为存在差异。适配脚本将其强制转换为 float32，显著降低重建误差（从 1.73% 降至 0.77%）。
Autocast 动态设备：推理脚本自动识别 npu / cuda / cpu，切换对应的 torch.amp.autocast 参数。
DDP Backend：多卡场景下自动选择 hccl（昇腾）或 nccl（GPU）。

8. ImageNet 重建评估

如需在 ImageNet 验证集上跑完整的重建评估（rFID / PSNR / LPIPS / SSIM）：

# 单 NPU
python3 test_reconstruction_hf_npu.py \
    --model_path ./model_weights/MiniMax/VTP-Small-f16d64 \
    --data_path /path/to/imagenet/val \
    --precision bf16

# 多 NPU DDP
torchrun --nproc_per_node=8 test_reconstruction_hf_npu.py \
    --model_path ./model_weights/MiniMax/VTP-Small-f16d64 \
    --data_path /path/to/imagenet/val \
    --use_ddp \
    --precision bf16

9. 交付件说明

文件	说明
`inference.py`	NPU 推理脚本（含精度验证）
`npu_compat.py`	NPU 设备检测与模型适配层
`test_reconstruction_hf_npu.py`	ImageNet 重建评估脚本（NPU 适配版）
`readme.md`	本文档
`npu_inference_log.txt`	自验证运行日志
`npu_device_info.txt`	NPU 设备信息