Unet 模型迁移适配指导

1. 模型概述

UNet 是一种经典的编码器-解码器结构（Encoder-Decoder）的深度学习模型，最初为医学图像分割设计，凭借其出色的像素级预测能力，在智驾领域被广泛应用于语义分割、车道线检测、可行驶区域分割等场景。

2. 准备运行环境

2.1 软件环境

组件	版本
Python	3.11
PyTorch	2.5.1
torch_npu	2.5.1.post1.dev20250722
CANN	cann_8.2.rc1

2.2 硬件环境

设备型号	NPU 配置
Atlas 800T A3	单卡 / 多卡（0~15）

2.3 准备镜像

镜像环境	镜像地址
公网	swr.cn-southwest-2.myhuaweicloud.com/atelier/pytorch_ascend:pytorch_2.5.1-cann_8.2.rc1-py_3.11-hce_2.0.2503-aarch64-snt9b23-20250729103313-3a25129

2.4 启动镜像

    docker run -itd -u root \
    --privileged \
    --device=/dev/davinci0 \
    --device=/dev/davinci1 \
    --device=/dev/davinci2 \
    --device=/dev/davinci3 \
    --device=/dev/davinci4 \
    --device=/dev/davinci5 \
    --device=/dev/davinci6 \
    --device=/dev/davinci7 \
    --device=/dev/davinci8 \
    --device=/dev/davinci9 \
    --device=/dev/davinci10 \
    --device=/dev/davinci11 \
    --device=/dev/davinci12 \
    --device=/dev/davinci13 \
    --device=/dev/davinci14 \
    --device=/dev/davinci15 \
    --device=/dev/davinci_manager \
    --device=/dev/devmm_svm \
    --device=/dev/hisi_hdc \
    -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/bin/hccn_tool:/usr/bin/hccn_tool \
    -v /etc/hccn.conf:/etc/hccn.conf \
    --shm-size 1024g --net=host \
    -v <host_dir>:<container_dir> \
    --name <container_name> <image_id> /bin/bash

3 运行指导

3.1 创建环境

docker exec -it unet bash 
conda create -n unet --clone PyTorch-2.5.1
conda activate unet

3.2 Pip 源配置（强烈建议）

为避免依赖下载失败或速度过慢，建议统一使用 华为内部 PyPI 镜像源：

pip config --user set global.index https://mirrors.huaweicloud.com/repository/pypi
pip config --user set global.index-url https://mirrors.huaweicloud.com/repository/pypi/simple
pip config --user set global.trusted-host mirrors.huaweicloud.com

3.3 下载模型源码

cd /home/ma-user/
git clone https://github.com/milesial/Pytorch-UNet.git

3.4 安装依赖

 pip install -r requirements.txt

3.5 数据集准备

执行如下脚本下载数据集

bash scripts/download_data.sh

3.6 迁移适配

添加自动迁移代码在train.py添加如下代码，用于在昇腾NPU自动迁移。

import torch_npu
from torch_npu.contrib import transfer_to_npu

修复urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.wandb.ai', port=443)报错部分报错信息如下

Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/socc/lib/python3.9/site-packages/requests/adapters.py", line 589, in send
    resp = conn.urlopen(
  File "/home/ma-user/anaconda3/envs/socc/lib/python3.9/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
  File "/home/ma-user/anaconda3/envs/socc/lib/python3.9/site-packages/urllib3/util/retry.py", line 535, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by ProxyError('Unable to connect to proxy', OSError('Tunnel connection failed: 407 Proxy Authentication Required')))

During handling of the above exception, another exception occurred:

将wandb修改为线下模式，在train.py中添加如下代码

# 将wandb改用线下模式
os.environ["WANDB_MODE"] = "offline"  # 或 "disabled"

修复RuntimeError: Only c10::MemoryFormat::Contiguous is supported for creating a npu tensor 报错部分错误信息如下

  File "/home/ma-user/Pytorch-UNet/train.py", line 100, in train_model
    images = images.to(device=device, dtype=torch.float32, memory_format=torch.channels_last)
  File "/home/ma-user/anaconda3/envs/socc/lib/python3.9/site-packages/torch_npu/contrib/transfer_to_npu.py", line 151, in decorated
    return fn(*args, **kwargs)
RuntimeError: Only c10::MemoryFormat::Contiguous is supported for creating a npu tensor
[ERROR] 2026-01-28-10:37:09 (PID:95357, Device:0, RankID:-1) ERR01007 OPS feature not supported

华为昇腾 NPU（Ascend）目前仅支持 contiguous_format 和 preserve_format，因此还需要修改如下2个文件： evaluate.py 修改前

image = image.to(device=device, dtype=torch.float32, memory_format=torch.channels_last)

修改后

image = image.to(device=device, dtype=torch.float32)

train.py 修改前

images = images.to(device=device, dtype=torch.float32, memory_format=torch.channels_last)

修改后

images = images.to(device=device, dtype=torch.float32)

3.7 启动训练

# 指定4卡进行训练
export ASCEND_RT_VISIBLE_DEVICES=4
python train.py

3.8 性能

硬件	卡数	性能
910C	1	12.63 img/s