冬
gcw_IDzXRVNw/sam-vit-base-ascend
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

sam-vit-base Ascend NPU 部署指南

项目简介

sam-vit-base(Segment Anything Model - ViT Base)是 Meta AI 开发的图像分割模型,能够根据输入的提示(如点、框)生成高质量的对象掩码。该模型在 1100 万张图像和 11 亿个掩码的数据集上训练,在各种分割任务上具有强大的零样本性能。

特性

  • 支持 Ascend NPU 推理加速
  • 根据点或边界框提示生成精确的分割掩码
  • CPU 与 NPU 精度对比测试(误差 < 1%)
  • 输出 256x256 分割掩码
  • 兼容 HuggingFace transformers

环境要求

  • 硬件:华为 Ascend 910 系列 NPU
  • CANN:8.0.RC1 或更高版本
  • PyTorch:2.8.0+ 并带有 torch_npu
  • Docker:容器名称 test-modelagent
  • transformers:4.29+
  • PIL(pillow)、torchvision

目录结构

sam-vit-base-ascend/
├── inference.py          # 推理测试脚本
├── test_mask.png          # 测试分割掩码输出
├── log.txt                # 精度测试日志
├── log_inference.txt      # 推理测试日志
├── README.md              # 本文档

部署步骤

1. 进入容器

docker exec -it test-modelagent bash

2. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

3. 准备模型文件

模型文件位于 /data/ysws/agentsp/5-15/sam-vit-base/ 目录下:

  • model.safetensors - 模型权重
  • config.json - 模型配置
  • preprocessor_config.json - 预处理器配置
  • pytorch_model.bin - PyTorch 模型权重

使用方式

方式一:普通推理模式

运行推理脚本进行图像分割:

cd /data/ysws/agentsp/5-15/sam-vit-base-ascend/

python3 inference.py --mode inference --device npu:0

方式二:精度测试模式 (CPU vs NPU)

运行精度对比测试,验证 NPU 计算结果与 CPU 一致性:

cd /data/ysws/agentsp/5-15/sam-vit-base-ascend/

python3 inference.py --mode precision_test

命令行参数说明

参数说明默认值
--mode测试模式: inference 或 precision_testinference
--device运行设备npu:0 (自动检测)

测试验证

精度测试结果

指标实测值阈值状态
IoU 相对误差0.0152%< 1.00%PASS
IoU Cosine 相似度1.000000> 0.99PASS
Pred Masks 相对误差0.1077%< 1.00%PASS
Pred Masks Cosine 相似度1.000000> 0.99PASS

性能数据

操作耗时
CPU 推理时间58.590s
NPU 推理时间6.154s
NPU 加速比~9.5x

推理结果示例

输入输出维度推理时间
256x256 RGB图像 + 点提示[1, 1, 3, 256, 256] masks5.905s

测试日志

推理模式日志 (log_inference.txt)

2026-05-15 14:49:51,781 - INFO - ============================================================
2026-05-15 14:49:51,781 - INFO - sam-vit-base NPU 推理测试
2026-05-15 14:49:51,781 - INFO - ============================================================
2026-05-15 14:49:51,781 - INFO - Model dir: /data/ysws/agentsp/5-15/sam-vit-base
2026-05-15 14:49:51,781 - INFO - Output dir: /data/ysws/agentsp/5-15/sam-vit-base-ascend
2026-05-15 14:49:51,781 - INFO - NPU available: True
2026-05-15 14:49:51,781 - INFO - NPU device count: 8
2026-05-15 14:49:53,433 - INFO - NPU 0: Ascend910B3, total_memory=61.0GB
2026-05-15 14:49:53,433 - INFO - NPU 1: Ascend910B3, total_memory=61.0GB
2026-05-15 14:49:53,433 - INFO - ============================================================
2026-05-15 14:49:53,433 - INFO - Inference Test on npu:0
2026-05-15 14:49:53,433 - INFO - ============================================================
2026-05-15 14:49:58,822 - INFO - Device: npu:0
2026-05-15 14:49:58,823 - INFO - Loading processor...
2026-05-15 14:50:00,539 - INFO - Model loaded successfully
2026-05-15 14:50:00,541 - INFO - Test image size: (256, 256)
2026-05-15 14:50:00,613 - INFO - pixel_values shape: torch.Size([1, 3, 1024, 1024])
2026-05-15 14:50:00,613 - INFO - input_points shape: torch.Size([1, 1, 1, 2])
2026-05-15 14:50:06,518 - INFO - Inference time: 5.905s
2026-05-15 14:50:06,519 - INFO - Output type: <class 'transformers.models.sam.modeling_sam.SamImageSegmentationOutput'>
2026-05-15 14:50:06,519 - INFO - pred_masks shape: torch.Size([1, 1, 3, 256, 256])
2026-05-15 14:50:06,519 - INFO - iou_scores shape: torch.Size([1, 1, 3])
2026-05-15 14:50:07,792 - INFO - Post-processed masks: 1 mask(s)
2026-05-15 14:50:07,792 - INFO -   mask[0] shape: torch.Size([1, 3, 256, 256])
2026-05-15 14:50:07,805 - INFO - Saved mask to: /data/ysws/agentsp/5-15/sam-vit-base-ascend/test_mask.png
2026-05-15 14:50:07,808 - INFO - ============================================================
2026-05-15 14:50:07,808 - INFO - INFERENCE RESULT
2026-05-15 14:50:07,808 - INFO - ============================================================
2026-05-15 14:50:07,808 - INFO - Inference time: 5.905s
2026-05-15 14:50:07,808 - INFO - ============================================================
2026-05-15 14:50:07,808 - INFO - Test Complete!
2026-05-15 14:50:07,808 - INFO - ============================================================

精度测试模式日志 (log.txt)

2026-05-15 14:48:00,415 - INFO - ============================================================
2026-05-15 14:48:00,415 - INFO - sam-vit-base NPU 推理测试
2026-05-15 14:48:00,415 - INFO - ============================================================
2026-05-15 14:48:00,415 - INFO - Model dir: /data/ysws/agentsp/5-15/sam-vit-base
2026-05-15 14:48:00,415 - INFO - Output dir: /data/ysws/agentsp/5-15/sam-vit-base-ascend
2026-05-15 14:48:00,415 - INFO - NPU available: True
2026-05-15 14:48:00,416 - INFO - NPU device count: 8
2026-05-15 14:48:02,078 - INFO - NPU 0: Ascend910B3, total_memory=61.0GB
2026-05-15 14:48:02,079 - INFO - NPU 1: Ascend910B3, total_memory=61.0GB
2026-05-15 14:48:02,079 - INFO - ============================================================
2026-05-15 14:48:02,079 - INFO - Precision Test: CPU vs NPU (threshold: 1.0%)
2026-05-15 14:48:02,079 - INFO - ============================================================
2026-05-15 14:48:07,540 - INFO - Loading processor...
2026-05-15 14:48:07,550 - INFO - Loading model for CPU...
2026-05-15 14:48:07,846 - INFO - Loading model for NPU...
2026-05-15 14:48:09,272 - INFO - pixel_values shape: torch.Size([1, 3, 1024, 1024])
2026-05-15 14:48:09,274 - INFO - input_points shape: torch.Size([1, 1, 1, 2])
2026-05-15 14:48:09,274 - INFO - Running inference on CPU...
2026-05-15 14:49:07,892 - INFO - Running inference on NPU...
2026-05-15 14:49:15,263 - INFO - pred_masks CPU shape: (1, 1, 3, 256, 256)
2026-05-15 14:49:15,263 - INFO - pred_masks NPU shape: (1, 1, 3, 256, 256)
2026-05-15 14:49:15,263 - INFO - CPU inference time: 58.590s
2026-05-15 14:49:15,264 - INFO - NPU inference time: 6.154s
2026-05-15 14:49:15,268 - INFO - === IoU Scores Precision ===
2026-05-15 14:49:15,268 - INFO - IoU max relative error: 1.521992e-04 (0.0152%)
2026-05-15 14:49:15,268 - INFO - IoU cosine similarity: 1.000000
2026-05-15 14:49:15,268 - INFO - === Pred Masks Precision ===
2026-05-15 14:49:15,268 - INFO - Max absolute error: 1.959991e-02
2026-05-15 14:49:15,268 - INFO - Max relative error: 1.076624e-03 (0.1077%)
2026-05-15 14:49:15,268 - INFO - Mean relative error: 2.195446e-04 (0.0220%)
2026-05-15 14:49:15,269 - INFO - Cosine similarity: 1.000000 (-0.0000% angular error)
2026-05-15 14:49:15,269 - INFO - PASS: True (threshold: 1.0%)
2026-05-15 14:49:15,302 - INFO - ============================================================
2026-05-15 14:49:15,302 - INFO - PRECISION TEST RESULT
2026-05-15 14:49:15,302 - INFO - ============================================================
2026-05-15 14:49:15,302 - INFO - Relative error: 1.076624e-03
2026-05-15 14:49:15,302 - INFO - CPU time: 58.590s
2026-05-15 14:49:15,303 - INFO - NPU time: 6.154s
2026-05-15 14:49:15,303 - INFO - PASS: True
2026-05-15 14:49:15,303 - INFO - ============================================================
2026-05-15 14:49:15,303 - INFO - Test Complete!
2026-05-15 14:49:15,303 - INFO - ============================================================

Python API 使用示例

基本推理

import torch
from PIL import Image
import numpy as np
from transformers import SamModel, SamProcessor

MODEL_DIR = "/data/ysws/agentsp/5-15/sam-vit-base"

processor = SamProcessor.from_pretrained(MODEL_DIR)
model = SamModel.from_pretrained(MODEL_DIR)
model = model.to("npu:0")
model.eval()

raw_image = Image.fromarray(np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8))
input_points = [[[128, 128]]]

inputs = processor(raw_image, input_points=input_points, return_tensors="pt")
inputs = {k: v.to("npu:0") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu()
)
print(f"Masks shape: {masks[0].shape}")  # [1, 3, 256, 256]

点提示分割

input_points = [[[450, 600]]]  # 2D point coordinates
inputs = processor(raw_image, input_points=input_points, return_tensors="pt")

边界框提示分割

input_boxes = [[[x1, y1, x2, y2]]]  # bounding box coordinates
inputs = processor(raw_image, input_boxes=input_boxes, return_tensors="pt")

保存分割掩码

from PIL import Image

mask_output = masks[0][0, 0].cpu().numpy()
mask_img = Image.fromarray((mask_output * 255).astype(np.uint8))
mask_img.save("output_mask.png")

模型结构

SAM 模型由三个主要模块组成:

  • VisionEncoder(视觉编码器):基于 ViT 的图像编码器,使用注意力机制计算图像嵌入
  • PromptEncoder(提示编码器):生成点和边界框的嵌入
  • MaskDecoder(掩码解码器):双向 transformer,在图像嵌入和点嵌入之间进行交叉注意力
组件说明
vision_encoderViT-Base,12层,768隐藏维度
prompt_encoder256隐藏维度,4个点嵌入维度
mask_decoder2层transformer,输出256x256掩码

推理参数配置

从 config.json 提取的关键参数:

{
  "vision_config.hidden_size": 768,
  "vision_config.num_hidden_layers": 12,
  "vision_config.num_attention_heads": 12,
  "vision_config.image_size": 1024,
  "vision_config.patch_size": 16,
  "prompt_encoder_config.hidden_size": 256,
  "prompt_encoder_config.image_embedding_size": 64,
  "mask_decoder_config.hidden_size": 256,
  "mask_decoder_config.num_hidden_layers": 2
}

常见问题

Q: 如何提高推理速度?

A: NPU 推理比 CPU 快约 9.5 倍。使用批处理点提示(points_per_batch 参数)可以进一步提高吞吐量。

Q: 分割掩码的格式是什么?

A: 输出掩码形状为 [batch, 1, num_masks, height, width],其中 num_masks 通常为 3(三个不同的掩码预测)。

Q: 如何处理大图像?

A: SAM 预处理器会自动将图像调整为最长边 1024 像素。可以修改 preprocessor_config.json 中的 size 参数来调整。

参考链接

  • 原始模型: https://huggingface.co/facebook/sam-vit-base
  • SAM 项目: https://arxiv.org/abs/2304.02643
  • 官方代码: https://github.com/facebookresearch/segment-anything
  • HuggingFace Transformers: https://huggingface.co/transformers

许可证

本项目遵循 Apache-2.0 许可证