MingTok-Vision 模型卡片（NPU 适配版）

本仓库为 inclusionAI/MingTok-Vision 模型提供在华为昇腾 NPU 上的适配推理、性能测试与精度验证脚本。

原始权重来源：https://huggingface.co/inclusionAI/MingTok-Vision

模型介绍

属性	数值
架构	MingTok (Continuous Unified Vision Tokenizer)
参数量	~697M
输入尺寸	[3, 512, 512]
低级编码器	深度=12，嵌入维度=768， patch 大小=32
语义解码器	深度=24，嵌入维度=1024， patch 大小=32
像素解码器	深度=24，嵌入维度=1024， patch 大小=16
输出 Patch Tokens	[B, 256, 1024]
输出 Latent	[B, 257, 32]

环境配置

前置依赖

Python >= 3.8
PyTorch >= 2.0
torch_npu（昇腾 NPU 驱动）
transformers
torchvision
omegaconf
pillow
einops

安装依赖

pip install torch torchvision transformers omegaconf pillow einops

说明：torch_npu 需根据您的 CANN 版本安装，请参考昇腾官方文档。

模型权重下载

modelscope download --model inclusionAI/MingTok-Vision --local_dir ./weight

NPU 推理

在 NPU 上执行单图特征提取推理：

python3 inference.py --device npu --image assets/mingtok.png

运行输出示例

Loading model on NPU...
Model parameters: 697.7M
Loading image from assets/mingtok.png...
Input tensor shape: torch.Size([1, 3, 512, 512])
Patch tokens shape: torch.Size([1, 256, 1024])
Latent shape: torch.Size([1, 257, 32])
Patch tokens top-5 values: [-0.0935695  0.0514918  0.010856  -0.0589132  0.0265448]
Inference completed successfully!

本模型通过 model.to("npu") 加载到 NPU 运行，并对 torch.cuda.amp.autocast 进行了轻量适配，无需对第三方库代码进行 monkey-patch。

性能测试

测试不同 batch size 下的推理延迟与吞吐量：

python3 benchmark.py --device npu

昇腾 NPU 测试结果

Batch Size	平均延迟 (ms)	吞吐量 (samples/s)
1	36.224	27.61
2	55.001	36.36
4	93.945	42.58
8	164.272	48.70
16	320.987	49.85

精度验证

对比 NPU 与 CPU 基线的输出一致性：

python3 accuracy.py --batch_size 1

验证结果（NPU vs CPU, Batch Size = 1, 20 组样本）

指标	数值
均方误差（MSE）	0.000009
平均绝对误差（MAE）	0.000555
范围归一化相对误差 (%)	0.006599
余弦相似度	0.999989
最大绝对差	0.072397

验证结论：通过 — 范围归一化相对误差 = 0.0066% < 1%，余弦相似度 = 0.999989

文件说明

文件	说明
`inference.py`	NPU 单图特征提取推理脚本
`benchmark.py`	多 batch 延迟与吞吐量测试
`accuracy.py`	NPU 与 CPU 精度对比
`mingtok/`	MingTok 模型源代码（来自原始仓库）
`assets/mingtok.png`	测试用示例图片
`output/`	运行日志

引用

@article{huang2025mingunivision,
  title={Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer},
  author={Huang, Ziyuan and Zheng, DanDan and Zou, Cheng and Liu, Rui and Wang, Xiaolong and Ji, Kaixiang and Chai, Weilong and Sun, Jianxin and Wang, Libin and Lv, Yongjie and Huang, Taozhi and Liu, Jiajia and Guo, Qingpei and Yang, Ming and Chen, Jingdong and Zhou, Jun},
  journal={arXiv preprint arXiv:2510.06590},
  year={2025}
}

MingTok-Vision 模型卡片（NPU 适配版）

本仓库为 inclusionAI/MingTok-Vision 模型提供在华为昇腾 NPU 上的适配推理、性能测试与精度验证脚本。

原始权重来源：https://huggingface.co/inclusionAI/MingTok-Vision

模型介绍

属性	数值
架构	MingTok (Continuous Unified Vision Tokenizer)
参数量	~697M
输入尺寸	[3, 512, 512]
低级编码器	深度=12，嵌入维度=768， patch 大小=32
语义解码器	深度=24，嵌入维度=1024， patch 大小=32
像素解码器	深度=24，嵌入维度=1024， patch 大小=16
输出 Patch Tokens	[B, 256, 1024]
输出 Latent	[B, 257, 32]

环境配置

前置依赖

Python >= 3.8
PyTorch >= 2.0
torch_npu（昇腾 NPU 驱动）
transformers
torchvision
omegaconf
pillow
einops

安装依赖

pip install torch torchvision transformers omegaconf pillow einops

说明：torch_npu 需根据您的 CANN 版本安装，请参考昇腾官方文档。

模型权重下载

modelscope download --model inclusionAI/MingTok-Vision --local_dir ./weight

NPU 推理

在 NPU 上执行单图特征提取推理：

python3 inference.py --device npu --image assets/mingtok.png

运行输出示例

Loading model on NPU...
Model parameters: 697.7M
Loading image from assets/mingtok.png...
Input tensor shape: torch.Size([1, 3, 512, 512])
Patch tokens shape: torch.Size([1, 256, 1024])
Latent shape: torch.Size([1, 257, 32])
Patch tokens top-5 values: [-0.0935695  0.0514918  0.010856  -0.0589132  0.0265448]
Inference completed successfully!

本模型通过 model.to("npu") 加载到 NPU 运行，并对 torch.cuda.amp.autocast 进行了轻量适配，无需对第三方库代码进行 monkey-patch。

性能测试

测试不同 batch size 下的推理延迟与吞吐量：

python3 benchmark.py --device npu

昇腾 NPU 测试结果

Batch Size	平均延迟 (ms)	吞吐量 (samples/s)
1	36.224	27.61
2	55.001	36.36
4	93.945	42.58
8	164.272	48.70
16	320.987	49.85

精度验证

对比 NPU 与 CPU 基线的输出一致性：

python3 accuracy.py --batch_size 1

验证结果（NPU vs CPU, Batch Size = 1, 20 组样本）

指标	数值
均方误差（MSE）	0.000009
平均绝对误差（MAE）	0.000555
范围归一化相对误差 (%)	0.006599
余弦相似度	0.999989
最大绝对差	0.072397

验证结论：通过 — 范围归一化相对误差 = 0.0066% < 1%，余弦相似度 = 0.999989

文件说明

文件	说明
`inference.py`	NPU 单图特征提取推理脚本
`benchmark.py`	多 batch 延迟与吞吐量测试
`accuracy.py`	NPU 与 CPU 精度对比
`mingtok/`	MingTok 模型源代码（来自原始仓库）
`assets/mingtok.png`	测试用示例图片
`output/`	运行日志

引用

@article{huang2025mingunivision,
  title={Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer},
  author={Huang, Ziyuan and Zheng, DanDan and Zou, Cheng and Liu, Rui and Wang, Xiaolong and Ji, Kaixiang and Chai, Weilong and Sun, Jianxin and Wang, Libin and Lv, Yongjie and Huang, Taozhi and Liu, Jiajia and Guo, Qingpei and Yang, Ming and Chen, Jingdong and Zhou, Jun},
  journal={arXiv preprint arXiv:2510.06590},
  year={2025}
}