冬

sam-vit-base Ascend NPU 部署指南

项目简介

sam-vit-base（Segment Anything Model - ViT Base）是 Meta AI 开发的图像分割模型，能够根据输入的提示（如点、框）生成高质量的对象掩码。该模型在 1100 万张图像和 11 亿个掩码的数据集上训练，在各种分割任务上具有强大的零样本性能。

特性

支持 Ascend NPU 推理加速
根据点或边界框提示生成精确的分割掩码
CPU 与 NPU 精度对比测试（误差 < 1%）
输出 256x256 分割掩码
兼容 HuggingFace transformers

环境要求

硬件：华为 Ascend 910 系列 NPU
CANN：8.0.RC1 或更高版本
PyTorch：2.8.0+ 并带有 torch_npu
Docker：容器名称 test-modelagent
transformers：4.29+
PIL（pillow）、torchvision

目录结构

sam-vit-base-ascend/
├── inference.py          # 推理测试脚本
├── test_mask.png          # 测试分割掩码输出
├── log.txt                # 精度测试日志
├── log_inference.txt      # 推理测试日志
├── README.md              # 本文档

部署步骤

1. 进入容器

docker exec -it test-modelagent bash

2. 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

3. 准备模型文件

模型文件位于 /data/ysws/agentsp/5-15/sam-vit-base/ 目录下：

model.safetensors - 模型权重
config.json - 模型配置
preprocessor_config.json - 预处理器配置
pytorch_model.bin - PyTorch 模型权重

使用方式

方式一：普通推理模式

运行推理脚本进行图像分割：

cd /data/ysws/agentsp/5-15/sam-vit-base-ascend/

python3 inference.py --mode inference --device npu:0

方式二：精度测试模式 (CPU vs NPU)

运行精度对比测试，验证 NPU 计算结果与 CPU 一致性：

cd /data/ysws/agentsp/5-15/sam-vit-base-ascend/

python3 inference.py --mode precision_test

命令行参数说明

参数	说明	默认值
`--mode`	测试模式: inference 或 precision_test	`inference`
`--device`	运行设备	`npu:0` (自动检测)

测试验证

精度测试结果

指标	实测值	阈值	状态
IoU 相对误差	0.0152%	< 1.00%	PASS
IoU Cosine 相似度	1.000000	> 0.99	PASS
Pred Masks 相对误差	0.1077%	< 1.00%	PASS
Pred Masks Cosine 相似度	1.000000	> 0.99	PASS

性能数据

操作	耗时
CPU 推理时间	58.590s
NPU 推理时间	6.154s
NPU 加速比	~9.5x

推理结果示例

输入	输出维度	推理时间
256x256 RGB图像 + 点提示	[1, 1, 3, 256, 256] masks	5.905s

测试日志

推理模式日志 (log_inference.txt)

2026-05-15 14:49:51,781 - INFO - ============================================================
2026-05-15 14:49:51,781 - INFO - sam-vit-base NPU 推理测试
2026-05-15 14:49:51,781 - INFO - ============================================================
2026-05-15 14:49:51,781 - INFO - Model dir: /data/ysws/agentsp/5-15/sam-vit-base
2026-05-15 14:49:51,781 - INFO - Output dir: /data/ysws/agentsp/5-15/sam-vit-base-ascend
2026-05-15 14:49:51,781 - INFO - NPU available: True
2026-05-15 14:49:51,781 - INFO - NPU device count: 8
2026-05-15 14:49:53,433 - INFO - NPU 0: Ascend910B3, total_memory=61.0GB
2026-05-15 14:49:53,433 - INFO - NPU 1: Ascend910B3, total_memory=61.0GB
2026-05-15 14:49:53,433 - INFO - ============================================================
2026-05-15 14:49:53,433 - INFO - Inference Test on npu:0
2026-05-15 14:49:53,433 - INFO - ============================================================
2026-05-15 14:49:58,822 - INFO - Device: npu:0
2026-05-15 14:49:58,823 - INFO - Loading processor...
2026-05-15 14:50:00,539 - INFO - Model loaded successfully
2026-05-15 14:50:00,541 - INFO - Test image size: (256, 256)
2026-05-15 14:50:00,613 - INFO - pixel_values shape: torch.Size([1, 3, 1024, 1024])
2026-05-15 14:50:00,613 - INFO - input_points shape: torch.Size([1, 1, 1, 2])
2026-05-15 14:50:06,518 - INFO - Inference time: 5.905s
2026-05-15 14:50:06,519 - INFO - Output type: <class 'transformers.models.sam.modeling_sam.SamImageSegmentationOutput'>
2026-05-15 14:50:06,519 - INFO - pred_masks shape: torch.Size([1, 1, 3, 256, 256])
2026-05-15 14:50:06,519 - INFO - iou_scores shape: torch.Size([1, 1, 3])
2026-05-15 14:50:07,792 - INFO - Post-processed masks: 1 mask(s)
2026-05-15 14:50:07,792 - INFO -   mask[0] shape: torch.Size([1, 3, 256, 256])
2026-05-15 14:50:07,805 - INFO - Saved mask to: /data/ysws/agentsp/5-15/sam-vit-base-ascend/test_mask.png
2026-05-15 14:50:07,808 - INFO - ============================================================
2026-05-15 14:50:07,808 - INFO - INFERENCE RESULT
2026-05-15 14:50:07,808 - INFO - ============================================================
2026-05-15 14:50:07,808 - INFO - Inference time: 5.905s
2026-05-15 14:50:07,808 - INFO - ============================================================
2026-05-15 14:50:07,808 - INFO - Test Complete!
2026-05-15 14:50:07,808 - INFO - ============================================================

精度测试模式日志 (log.txt)

2026-05-15 14:48:00,415 - INFO - ============================================================
2026-05-15 14:48:00,415 - INFO - sam-vit-base NPU 推理测试
2026-05-15 14:48:00,415 - INFO - ============================================================
2026-05-15 14:48:00,415 - INFO - Model dir: /data/ysws/agentsp/5-15/sam-vit-base
2026-05-15 14:48:00,415 - INFO - Output dir: /data/ysws/agentsp/5-15/sam-vit-base-ascend
2026-05-15 14:48:00,415 - INFO - NPU available: True
2026-05-15 14:48:00,416 - INFO - NPU device count: 8
2026-05-15 14:48:02,078 - INFO - NPU 0: Ascend910B3, total_memory=61.0GB
2026-05-15 14:48:02,079 - INFO - NPU 1: Ascend910B3, total_memory=61.0GB
2026-05-15 14:48:02,079 - INFO - ============================================================
2026-05-15 14:48:02,079 - INFO - Precision Test: CPU vs NPU (threshold: 1.0%)
2026-05-15 14:48:02,079 - INFO - ============================================================
2026-05-15 14:48:07,540 - INFO - Loading processor...
2026-05-15 14:48:07,550 - INFO - Loading model for CPU...
2026-05-15 14:48:07,846 - INFO - Loading model for NPU...
2026-05-15 14:48:09,272 - INFO - pixel_values shape: torch.Size([1, 3, 1024, 1024])
2026-05-15 14:48:09,274 - INFO - input_points shape: torch.Size([1, 1, 1, 2])
2026-05-15 14:48:09,274 - INFO - Running inference on CPU...
2026-05-15 14:49:07,892 - INFO - Running inference on NPU...
2026-05-15 14:49:15,263 - INFO - pred_masks CPU shape: (1, 1, 3, 256, 256)
2026-05-15 14:49:15,263 - INFO - pred_masks NPU shape: (1, 1, 3, 256, 256)
2026-05-15 14:49:15,263 - INFO - CPU inference time: 58.590s
2026-05-15 14:49:15,264 - INFO - NPU inference time: 6.154s
2026-05-15 14:49:15,268 - INFO - === IoU Scores Precision ===
2026-05-15 14:49:15,268 - INFO - IoU max relative error: 1.521992e-04 (0.0152%)
2026-05-15 14:49:15,268 - INFO - IoU cosine similarity: 1.000000
2026-05-15 14:49:15,268 - INFO - === Pred Masks Precision ===
2026-05-15 14:49:15,268 - INFO - Max absolute error: 1.959991e-02
2026-05-15 14:49:15,268 - INFO - Max relative error: 1.076624e-03 (0.1077%)
2026-05-15 14:49:15,268 - INFO - Mean relative error: 2.195446e-04 (0.0220%)
2026-05-15 14:49:15,269 - INFO - Cosine similarity: 1.000000 (-0.0000% angular error)
2026-05-15 14:49:15,269 - INFO - PASS: True (threshold: 1.0%)
2026-05-15 14:49:15,302 - INFO - ============================================================
2026-05-15 14:49:15,302 - INFO - PRECISION TEST RESULT
2026-05-15 14:49:15,302 - INFO - ============================================================
2026-05-15 14:49:15,302 - INFO - Relative error: 1.076624e-03
2026-05-15 14:49:15,302 - INFO - CPU time: 58.590s
2026-05-15 14:49:15,303 - INFO - NPU time: 6.154s
2026-05-15 14:49:15,303 - INFO - PASS: True
2026-05-15 14:49:15,303 - INFO - ============================================================
2026-05-15 14:49:15,303 - INFO - Test Complete!
2026-05-15 14:49:15,303 - INFO - ============================================================

Python API 使用示例

基本推理

import torch
from PIL import Image
import numpy as np
from transformers import SamModel, SamProcessor

MODEL_DIR = "/data/ysws/agentsp/5-15/sam-vit-base"

processor = SamProcessor.from_pretrained(MODEL_DIR)
model = SamModel.from_pretrained(MODEL_DIR)
model = model.to("npu:0")
model.eval()

raw_image = Image.fromarray(np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8))
input_points = [[[128, 128]]]

inputs = processor(raw_image, input_points=input_points, return_tensors="pt")
inputs = {k: v.to("npu:0") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu()
)
print(f"Masks shape: {masks[0].shape}")  # [1, 3, 256, 256]

点提示分割

input_points = [[[450, 600]]]  # 2D point coordinates
inputs = processor(raw_image, input_points=input_points, return_tensors="pt")

边界框提示分割

input_boxes = [[[x1, y1, x2, y2]]]  # bounding box coordinates
inputs = processor(raw_image, input_boxes=input_boxes, return_tensors="pt")

保存分割掩码

from PIL import Image

mask_output = masks[0][0, 0].cpu().numpy()
mask_img = Image.fromarray((mask_output * 255).astype(np.uint8))
mask_img.save("output_mask.png")

模型结构

SAM 模型由三个主要模块组成：

VisionEncoder（视觉编码器）：基于 ViT 的图像编码器，使用注意力机制计算图像嵌入
PromptEncoder（提示编码器）：生成点和边界框的嵌入
MaskDecoder（掩码解码器）：双向 transformer，在图像嵌入和点嵌入之间进行交叉注意力

组件	说明
vision_encoder	ViT-Base，12层，768隐藏维度
prompt_encoder	256隐藏维度，4个点嵌入维度
mask_decoder	2层transformer，输出256x256掩码

推理参数配置

从 config.json 提取的关键参数:

{
  "vision_config.hidden_size": 768,
  "vision_config.num_hidden_layers": 12,
  "vision_config.num_attention_heads": 12,
  "vision_config.image_size": 1024,
  "vision_config.patch_size": 16,
  "prompt_encoder_config.hidden_size": 256,
  "prompt_encoder_config.image_embedding_size": 64,
  "mask_decoder_config.hidden_size": 256,
  "mask_decoder_config.num_hidden_layers": 2
}

常见问题

Q: 如何提高推理速度？

A: NPU 推理比 CPU 快约 9.5 倍。使用批处理点提示（points_per_batch 参数）可以进一步提高吞吐量。

Q: 分割掩码的格式是什么？

A: 输出掩码形状为 [batch, 1, num_masks, height, width]，其中 num_masks 通常为 3（三个不同的掩码预测）。

Q: 如何处理大图像？

A: SAM 预处理器会自动将图像调整为最长边 1024 像素。可以修改 preprocessor_config.json 中的 size 参数来调整。

参考链接

原始模型: https://huggingface.co/facebook/sam-vit-base
SAM 项目: https://arxiv.org/abs/2304.02643
官方代码: https://github.com/facebookresearch/segment-anything
HuggingFace Transformers: https://huggingface.co/transformers

许可证

本项目遵循 Apache-2.0 许可证