YOLOv13模型迁移指导文档

1. 模型概述及场景

1.1 模型介绍

YOLOv13是用于目标检测的深度学习模型，采用先进的注意力机制和高效的特征提取网络，原始运行环境为GPU。

1.2 应用场景

用于目标检测任务，支持COCO数据集的80类物体检测，需要实现从GPU到昇腾NPU的训练和推理迁移。

2. 迁移环境

2.1 硬件环境

GPU基线：NVIDIA GPU
NPU目标：Ascend910B3
存储：本地存储

2.2 软件环境

项目	版本 / 规格
操作系统 / 架构	OpenEuler24.03 LTS / aarch64
驱动 / 固件	TODO(补充)
CANN	8.3RC2
Python	3.11.6
torch / torch_npu	2.1.0 / 2.1.0.post13.dev20250722
推理栈	ais_bench

版本查询截图：

版本查询

3. 资源与依赖

3.1 模型权重

`https://github.com/iMoonLab/yolov13/releases/download/yolov13/yolov13l.pt`

3.2 数据集

`http://images.cocodataset.org/zips/val2017.zip`
`http://images.cocodataset.org/annotations/annotations_trainval2017.zip`

3.3 代码仓

`https://gitcode.com/Ascend/modelzoo-GPL.git` (built-in/ACL_Pytorch/Yolov13_for_PyTorch)

3.4 容器镜像资源

train-images:v1

3.5 工具资源

atc、npu-smi、ais_bench

4. 环境准备

4.1 资源准备

获取ModelZoo代码和YOLOv13源码：

yum install git
git clone https://gitcode.com/Ascend/modelzoo-GPL.git
cd modelzoo-GPL/built-in/ACL_Pytorch/Yolov13_for_PyTorch
git clone https://github.com/iMoonLab/yolov13.git
cd yolov13
git reset --hard 73289949533efac82bb5f72ec19b746618656bd2
git apply ../diff.patch

下载模型权重文件：

wget https://github.com/iMoonLab/yolov13/releases/download/yolov13/yolov13l.pt

4.2 环境创建

拉取并创建容器环境：

docker run -it -u root -d --net=host \
 --privileged \
 --ipc=host \
 --device=/dev/davinci_manager \
 --device=/dev/devmm_svm \
 --device=/dev/hisi_hdc \
 -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
 -v /usr/local/dcmi:/usr/local/dcmi \
 -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 -v /usr/local/sbin:/usr/local/sbin \
 --name train_test \
 train-images:v1 \

4.3 依赖安装

安装Python依赖和数据集转换库：

pip3 install -r ../requirements.txt
yum install mesa-libGL mesa-libGL-devel
pip3 install ultralytics

环境缺少库函数报错：

缺少库函数报错

4.4 数据集准备

下载COCO数据集并转换为YOLO格式：

mkdir dataset
cd dataset
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip val2017.zip
unzip annotations_trainval2017.zip
yum install python3-pip
pip3 install ultralytics
python3 ../dataset_convert.py --data_path=dataset/annotations/

转换完成后移动图片文件：

mv dataset/val2017 coco_converted/images/

最终数据集结构：

yolov13
└── coco_converted
   ├── images
      └── val2017
         └── ***.jpg
   └── labels
      └── val2017
         └── ***.txt

5. 训练迁移步骤

5.1 数据准备

设置训练路径文件：

修改coco.yaml配置文件，将path改为coco_converted目录的绝对路径，val的值修改为images/val2017。

5.2 配置修改

需要修改关键代码模块以支持NPU训练。主要修改block.py中的AAttn类，添加NPU兼容性处理：

def forward(self, x):
        try:
            device = next(self.parameters()).device
        except StopIteration:
            device = x.device
            
        B, C, H, W = x.shape
        N = H * W
        # 保存原始形状
        B_orig, H_orig, W_orig = B, H, W
        N_orig = N

        qk = self.qk(x).flatten(2).transpose(1, 2)
        v = self.v(x)
        pp = self.pe(v)
        v = v.flatten(2).transpose(1, 2)

        if self.area > 1:
            qk = qk.reshape(B * self.area, N // self.area, C * 2)
            v = v.reshape(B * self.area, N // self.area, C)
            B, N, _ = qk.shape
        q, k = qk.split([C, C], dim=2)

        use_standard_attention = True

        if device.type == 'npu' and _HAS_NPU_FLASH_ATTN:
            try:
                import torch_npu
                # 显式转换为 half 满足算子要求
                q = q.half()
                k = k.half()
                v = v.half()
                x = torch_npu.npu_prompt_flash_attention(
                    q, k, v,
                    input_layout='BSH',
                    num_heads=self.num_heads,
                    scale_value=1 / math.sqrt(self.head_dim)
                )
                use_standard_attention = False
            except (AttributeError, RuntimeError):
                pass

        if use_standard_attention:
            q_reshaped = q.view(B, N, self.num_heads, self.head_dim).transpose(1, 2)
            k_reshaped = k.view(B, N, self.num_heads, self.head_dim).transpose(1, 2)
            v_reshaped = v.view(B, N, self.num_heads, self.head_dim).transpose(1, 2)

            attn = (q_reshaped @ k_reshaped.transpose(-2, -1)) * (1.0 / math.sqrt(self.head_dim))
            attn = attn.softmax(dim=-1)
            x = attn @ v_reshaped
            x = x.transpose(1, 2).reshape(B, N, C)

        # 恢复原始形状（area 分支后）
        if self.area > 1:
            x = x.reshape(B_orig, N_orig, C).contiguous()
            B, N, H, W = B_orig, N_orig, H_orig, W_orig

        x = x.reshape(B, H, W, C).permute(0, 3, 1, 2).contiguous()
        return self.proj(x + pp)

多卡训练需要设置环境变量：

export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
export HCCL_WHITELIST_DISABLE=1
export HCCL_DETERMINISTIC=false
export HCCL_CONNECT_TIMEOUT=600

5.3 单卡训练

创建训练脚本train_multi.py，执行单卡训练：

python train_multi.py --devices 0


训练脚本：

import torch
import os
import argparse
import torch_npu
from torch_npu.contrib import transfer_to_npu
from ultralytics import YOLO

def parse_args():
    parser = argparse.ArgumentParser(description='YOLOv13 Training Script')
    parser.add_argument('--devices', type=str, default='0',
                       help='NPU devices to use, e.g., "0" for single card, "0,1,2,3" for multi cards')
    parser.add_argument('--data', type=str, default='coco.yaml',
                       help='path to dataset config file')
    parser.add_argument('--model', type=str, default='yolov13l.pt',
                       help='path to model file')
    parser.add_argument('--epochs', type=int, default=1,
                       help='number of epochs')
    parser.add_argument('--imgsz', type=int, default=640,
                       help='image size')
    parser.add_argument('--batch', type=int, default=16,
                       help='batch size')
    parser.add_argument('--amp', action='store_true', default=True,
                       help='use AMP (Auto Mixed Precision)')
    return parser.parse_args()

def main():
    args = parse_args()
    model = YOLO(args.model)
    train_params = {
        'data': args.data,
        'epochs': args.epochs,
        'imgsz': args.imgsz,
        'batch': args.batch,
        'amp': args.amp,
        'device': devices
    }
    try:
        results = model.train(**train_params)
        print("Training completed successfully!")
    except Exception as e:
        print(f"Training failed with error: {e}")
        raise

if __name__ == "__main__":
    main()

训练成功截图：

![images/single_card_train_success1.png)

![images/single_card_train_success2.png)

5.4 多卡训练

执行多卡分布式训练：

torchrun --nproc_per_node=4 train_multi.py --devices 0,1,2,3

训练初期出现不均衡问题：

多卡训练不均衡1

多卡训练不均衡2

修正后训练均衡：

多卡训练均衡1

多卡训练均衡2

5.5 训练结果分析

绘制训练loss曲线的脚本：

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('results.csv')

plt.figure(figsize=(15, 10))

# 绘制训练loss
plt.subplot(2, 2, 1)
plt.plot(df['epoch'], df['train/box_loss'], label='Train Box Loss', marker='o', markersize=3)
plt.plot(df['epoch'], df['val/box_loss'], label='Val Box Loss', marker='s', markersize=3)
plt.title('Box Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)

# 绘制分类loss
plt.subplot(2, 2, 2)
plt.plot(df['epoch'], df['train/cls_loss'], label='Train Class Loss', marker='o', markersize=3)
plt.plot(df['epoch'], df['val/cls_loss'], label='Val Class Loss', marker='s', markersize=3)
plt.title('Classification Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('training_curves.png', dpi=300, bbox_inches='tight')
plt.show()

训练损失曲线：

Loss曲线

6. 推理迁移步骤

6.1 PyTorch转换ONNX

使用训练得到的best.pt文件导出为ONNX格式：

import torch

checkpoint = torch.load('best.pt', map_location='cpu')

if isinstance(checkpoint, dict):
    if 'model' in checkpoint:
        model = checkpoint['model']
    elif 'state_dict' in checkpoint:
        model = checkpoint['state_dict']
    else:
        model = checkpoint
else:
    model = checkpoint

model = model.float()
model.eval()
dummy_input = torch.randn(1, 3, 640, 640).float()

torch.onnx.export(
    model,
    dummy_input,
    'best.onnx',
    export_params=True,
    opset_version=17,
    do_constant_folding=True,
    input_names=['images'],
    output_names=['output']
)

print("转换完成！")

验证ONNX模型：

bash
python -c "import onnx; model = onnx.load('best.onnx'); onnx.checker.check_model(model); print('ONNX模型验证通过！')"

ONNX验证截图：

ONNX验证

6.2 ONNX转换OM

使用ATC工具将ONNX模型转换为OM格式：

bash
#!/bin/bash

source /usr/local/Ascend/ascend-toolkit/set_env.sh

ONNX_MODEL="yolov13.onnx"
OM_MODEL="yolov13"
CHIP_TYPE="Ascend910B3"
INPUT_SHAPE="images:1,3,640,640"
PRECISION="allow_mix_precision"

echo "开始转换ONNX模型到OM格式..."
atc --model=$ONNX_MODEL \
    --framework=5 \
    --output=$OM_MODEL \
    --input_format=NCHW \
    --input_shape=$INPUT_SHAPE \
    --precision_mode=$PRECISION \
    --soc_version=$CHIP_TYPE

if [ $? -eq 0 ]; then
    echo "模型转换成功！OM文件已保存为: $OM_MODEL"
else
    echo "模型转换失败，请检查错误信息。"
    exit 1
fi

注意：使用 --soc_version=Ascend910B3 而非 --chip_version；使用 --precision_mode=allow_mix_precision 而非 fp16。

转换成功截图：

OM转换成功

6.3 OM模型推理

推理环境版本查询：

推理版本查询

创建图片转二进制脚本jpg_bin.py：

import cv2, numpy as np, glob, os
os.makedirs("bin", exist_ok=True)
for f in glob.glob("coco10/images/*.jpg"):
    img = cv2.imread(f)
    img = cv2.resize(img, (640, 640))
    img = img.transpose(2, 0, 1)[None, ...].astype(np.float32)
    bin_name = f"bin/" + os.path.basename(f)[:-4] + ".bin"
    img.tofile(bin_name)

使用ais_bench执行推理：

bash
python jpg_bin.py
python3 -m ais_bench --model yolov13.om --input /home/modelzoo-GPL/built-in/ACL_Pytorch/Yolov13_for_PyTorch/yolov13/bin --output ascend_out/ --loop 100 --batchsize 5

推理脚本：

import cv2
import numpy as np
import argparse
import os
from ais_bench.infer.interface import InferSession
import random

COCO_CLASSES = [
    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',
    'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
    'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
    'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
    'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
    'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
    'hair drier', 'toothbrush'
]

def get_color(idx):
    idx = idx * 3
    color = ((37 * idx) % 255, (17 * idx) % 255, (29 * idx) % 255)
    return tuple(map(int, color))

def preprocess_image(image_path, input_size=640):
    img = cv2.imread(image_path)
    if img is None:
        raise ValueError(f"Failed to read image: {image_path}")
    orig_h, orig_w = img.shape[:2]

    img_resized = cv2.resize(img, (input_size, input_size))
    img_rgb = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB)
    img_normalized = img_rgb.astype(np.float32) / 255.0
    img_chw = np.transpose(img_normalized, (2, 0, 1))
    img_input = np.expand_dims(img_chw, axis=0)

    return img_input, img, (orig_h, orig_w)

def scale_coords_fixed(img1_shape, coords, img0_shape):
    gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])
    pad_x = (img1_shape[1] - img0_shape[1] * gain) / 2
    pad_y = (img1_shape[0] - img0_shape[0] * gain) / 2

    coords[:, [0, 2]] -= pad_x
    coords[:, [1, 3]] -= pad_y
    coords[:, :4] /= gain

    coords[:, [0, 2]] = np.clip(coords[:, [0, 2]], 0, img0_shape[1])
    coords[:, [1, 3]] = np.clip(coords[:, [1, 3]], 0, img0_shape[0])
    return coords

def postprocess(outputs, conf_thres=0.5, iou_thres=0.5, orig_shape=None):
    try:
        output = outputs[0]
        output = output.reshape(output.shape[-1], output.shape[-2])

        boxes = output[:, :4]
        scores = output[:, 4:]

        class_scores = np.max(scores, axis=1)
        class_ids = np.argmax(scores, axis=1)

        valid_indices = class_scores > conf_thres
        boxes = boxes[valid_indices]
        class_scores = class_scores[valid_indices]
        class_ids = class_ids[valid_indices]

        if len(boxes) == 0:
            return np.array([]), np.array([]), np.array([])

        boxes_xyxy = np.copy(boxes)
        boxes_xyxy[:, 0] = boxes[:, 0] - boxes[:, 2] / 2
        boxes_xyxy[:, 1] = boxes[:, 1] - boxes[:, 3] / 2
        boxes_xyxy[:, 2] = boxes[:, 0] + boxes[:, 2] / 2
        boxes_xyxy[:, 3] = boxes[:, 1] + boxes[:, 3] / 2

        indices = cv2.dnn.NMSBoxes(
            boxes_xyxy.tolist(),
            class_scores.tolist(),
            conf_thres,
            iou_thres
        )

        if len(indices) > 0:
            if isinstance(indices[0], list) or isinstance(indices[0], np.ndarray):
                indices = [idx[0] for idx in indices]
            else:
                indices = indices.tolist()
        else:
            indices = []

        final_boxes = boxes_xyxy[indices]
        final_scores = class_scores[indices]
        final_class_ids = class_ids[indices]

        if orig_shape and len(final_boxes) > 0:
            final_boxes = scale_coords_fixed((640, 640), final_boxes, orig_shape)

        return final_boxes, final_scores, final_class_ids
    except Exception as e:
        return np.array([]), np.array([]), np.array([])

def draw_detections(image, boxes, scores, class_ids):
    if len(boxes) == 0:
        return image

    h, w = image.shape[:2]

    for i in range(len(boxes)):
        box = boxes[i]
        score = scores[i]
        class_id = int(class_ids[i])

        x1, y1, x2, y2 = map(int, box)
        x1 = max(0, min(x1, w))
        y1 = max(0, min(y1, h))
        x2 = max(0, min(x2, w))
        y2 = max(0, min(y2, h))

        if x2 <= x1 or y2 <= y1:
            continue

        class_name = COCO_CLASSES[class_id] if class_id < len(COCO_CLASSES) else f"Class {class_id}"
        color = get_color(class_id)

        cv2.rectangle(image, (x1, y1), (x2, y2), color, 2)

        label = f"{class_name}: {score:.2f}"
        label_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 2)[0]

        cv2.rectangle(image, (x1, y1 - label_size[1] - 10), (x1 + label_size[0], y1), color, -1)
        cv2.putText(image, label, (x1, y1 - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)

    return image

def main():
    parser = argparse.ArgumentParser(description='YOLOv13 inference with OM model')
    parser.add_argument('-m', '--model', required=True, help='OM model file path')
    parser.add_argument('-i', '--input', required=True, help='Input image path or directory')
    parser.add_argument('--input-size', type=int, default=640, help='Model input size')
    parser.add_argument('--threshold', type=float, default=0.5, help='Confidence threshold')
    parser.add_argument('--output', default='output.jpg', help='Output image path')
    args = parser.parse_args()

    session = InferSession(0, args.model)

    if os.path.isdir(args.input):
        image_files = [f for f in os.listdir(args.input) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
        if not image_files:
            raise ValueError(f"No image files found in directory: {args.input}")
        image_path = os.path.join(args.input, random.choice(image_files))
    else:
        image_path = args.input

    img_input, orig_img, (orig_h, orig_w) = preprocess_image(image_path, args.input_size)

    outputs = session.infer([img_input])

    boxes, scores, class_ids = postprocess(
        outputs,
        conf_thres=args.threshold,
        iou_thres=0.5,
        orig_shape=(orig_h, orig_w)
    )

    if len(boxes) > 0:
        result_img = draw_detections(orig_img, boxes, scores, class_ids)
    else:
        result_img = orig_img

    cv2.imwrite(args.output, result_img)

if __name__ == "__main__":
    main()

推理成功截图：

OM推理成功1

OM推理成功2

ais_bench评估完成截图：

ais_bench完成

推理结果图片：

推理结果

6.4 推理结果分析

推理成功后可以正常检测COCO数据集中的80类物体。OM模型推理输出与原始PyTorch模型检测结果一致，验证了模型迁移的正确性。

7. 问题排查

问题 1：NPU Flash Attention算子不兼容

现象：训练时出现 NotImplementedError: Could not run 'npu::npu_prompt_flash_attention' with arguments from the 'CPU' backend 错误，显示该算子仅适用于NPU后端但张量在CPU上。

可能原因：模型中的Flash Attention算子未正确适配NPU环境，张量被错误分配到CPU设备。

处理方法：修改block.py中AAttn类的forward方法，添加设备检测和回退逻辑，当NPU Flash Attention不可用时使用标准注意力实现。

问题 2：TorchDynamo动态编译报错

现象：启用AMP时出现 torch._dynamo.exc.Unsupported: hasattr: PythonModuleVariable() 错误。

原因：TorchDynamo编译工具对hasattribute函数不支持。

临时处理方法：通过环境变量禁用TorchDynamo动态编译。

长期处理方式：可以通过 try-except 配合Dynamo的特定异常来控制哪些部分不需要编译

校验命令：

TORCHDYNAMO_DISABLE=1 python train.py

问题 3：NPU显存不足

现象：训练过程中出现 Out of Memory 错误。

可能原因：单卡显存不足支持当前batch size和模型尺寸。

处理方法：更换显存更大的NPU卡。

校验命令： npu-smi info

问题 4：多卡训练负载不均衡

现象：设置使用多卡训练时，进程集中在单卡运行，其他卡负载很低。

可能原因：设备分配和环境变量配置不正确。

处理方法：在导入torch前设置环境变量，使用LOCAL_RANK映射到物理设备。

export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7 export HCCL_WHITELIST_DISABLE=1 export HCCL_DETERMINISTIC=false export HCCL_CONNECT_TIMEOUT=600

问题 5：HCCL通信初始化失败

现象：多卡训练时出现 RuntimeError: HCCL error: uncorrect code=0x0000000b 错误。

可能原因：HCCL白名单限制或网络配置问题。

处理方法：禁用HCCL白名单，设置非确定性模式，配置网络环境变量 HCCL_IF_IP 和 HCCL_IF_NAME。