YOLOv13是用于目标检测的深度学习模型,采用先进的注意力机制和高效的特征提取网络,原始运行环境为GPU。
用于目标检测任务,支持COCO数据集的80类物体检测,需要实现从GPU到昇腾NPU的训练和推理迁移。
| 项目 | 版本 / 规格 |
|---|---|
| 操作系统 / 架构 | OpenEuler24.03 LTS / aarch64 |
| 驱动 / 固件 | TODO(补充) |
| CANN | 8.3RC2 |
| Python | 3.11.6 |
| torch / torch_npu | 2.1.0 / 2.1.0.post13.dev20250722 |
| 推理栈 | ais_bench |
版本查询截图:

`https://github.com/iMoonLab/yolov13/releases/download/yolov13/yolov13l.pt``http://images.cocodataset.org/zips/val2017.zip`
`http://images.cocodataset.org/annotations/annotations_trainval2017.zip``https://gitcode.com/Ascend/modelzoo-GPL.git` (built-in/ACL_Pytorch/Yolov13_for_PyTorch)train-images:v1
atc、npu-smi、ais_bench
获取ModelZoo代码和YOLOv13源码:
yum install git
git clone https://gitcode.com/Ascend/modelzoo-GPL.git
cd modelzoo-GPL/built-in/ACL_Pytorch/Yolov13_for_PyTorch
git clone https://github.com/iMoonLab/yolov13.git
cd yolov13
git reset --hard 73289949533efac82bb5f72ec19b746618656bd2
git apply ../diff.patch下载模型权重文件:
wget https://github.com/iMoonLab/yolov13/releases/download/yolov13/yolov13l.pt拉取并创建容器环境:
docker run -it -u root -d --net=host \
--privileged \
--ipc=host \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
--name train_test \
train-images:v1 \安装Python依赖和数据集转换库:
pip3 install -r ../requirements.txt
yum install mesa-libGL mesa-libGL-devel
pip3 install ultralytics环境缺少库函数报错:

下载COCO数据集并转换为YOLO格式:
mkdir dataset
cd dataset
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip val2017.zip
unzip annotations_trainval2017.zip
yum install python3-pip
pip3 install ultralytics
python3 ../dataset_convert.py --data_path=dataset/annotations/转换完成后移动图片文件:
mv dataset/val2017 coco_converted/images/最终数据集结构:
yolov13
└── coco_converted
├── images
└── val2017
└── ***.jpg
└── labels
└── val2017
└── ***.txt设置训练路径文件:
修改coco.yaml配置文件,将path改为coco_converted目录的绝对路径,val的值修改为images/val2017。
需要修改关键代码模块以支持NPU训练。主要修改block.py中的AAttn类,添加NPU兼容性处理:
def forward(self, x):
try:
device = next(self.parameters()).device
except StopIteration:
device = x.device
B, C, H, W = x.shape
N = H * W
# 保存原始形状
B_orig, H_orig, W_orig = B, H, W
N_orig = N
qk = self.qk(x).flatten(2).transpose(1, 2)
v = self.v(x)
pp = self.pe(v)
v = v.flatten(2).transpose(1, 2)
if self.area > 1:
qk = qk.reshape(B * self.area, N // self.area, C * 2)
v = v.reshape(B * self.area, N // self.area, C)
B, N, _ = qk.shape
q, k = qk.split([C, C], dim=2)
use_standard_attention = True
if device.type == 'npu' and _HAS_NPU_FLASH_ATTN:
try:
import torch_npu
# 显式转换为 half 满足算子要求
q = q.half()
k = k.half()
v = v.half()
x = torch_npu.npu_prompt_flash_attention(
q, k, v,
input_layout='BSH',
num_heads=self.num_heads,
scale_value=1 / math.sqrt(self.head_dim)
)
use_standard_attention = False
except (AttributeError, RuntimeError):
pass
if use_standard_attention:
q_reshaped = q.view(B, N, self.num_heads, self.head_dim).transpose(1, 2)
k_reshaped = k.view(B, N, self.num_heads, self.head_dim).transpose(1, 2)
v_reshaped = v.view(B, N, self.num_heads, self.head_dim).transpose(1, 2)
attn = (q_reshaped @ k_reshaped.transpose(-2, -1)) * (1.0 / math.sqrt(self.head_dim))
attn = attn.softmax(dim=-1)
x = attn @ v_reshaped
x = x.transpose(1, 2).reshape(B, N, C)
# 恢复原始形状(area 分支后)
if self.area > 1:
x = x.reshape(B_orig, N_orig, C).contiguous()
B, N, H, W = B_orig, N_orig, H_orig, W_orig
x = x.reshape(B, H, W, C).permute(0, 3, 1, 2).contiguous()
return self.proj(x + pp)多卡训练需要设置环境变量:
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
export HCCL_WHITELIST_DISABLE=1
export HCCL_DETERMINISTIC=false
export HCCL_CONNECT_TIMEOUT=600创建训练脚本train_multi.py,执行单卡训练:
python train_multi.py --devices 0
训练脚本:
import torch
import os
import argparse
import torch_npu
from torch_npu.contrib import transfer_to_npu
from ultralytics import YOLO
def parse_args():
parser = argparse.ArgumentParser(description='YOLOv13 Training Script')
parser.add_argument('--devices', type=str, default='0',
help='NPU devices to use, e.g., "0" for single card, "0,1,2,3" for multi cards')
parser.add_argument('--data', type=str, default='coco.yaml',
help='path to dataset config file')
parser.add_argument('--model', type=str, default='yolov13l.pt',
help='path to model file')
parser.add_argument('--epochs', type=int, default=1,
help='number of epochs')
parser.add_argument('--imgsz', type=int, default=640,
help='image size')
parser.add_argument('--batch', type=int, default=16,
help='batch size')
parser.add_argument('--amp', action='store_true', default=True,
help='use AMP (Auto Mixed Precision)')
return parser.parse_args()
def main():
args = parse_args()
model = YOLO(args.model)
train_params = {
'data': args.data,
'epochs': args.epochs,
'imgsz': args.imgsz,
'batch': args.batch,
'amp': args.amp,
'device': devices
}
try:
results = model.train(**train_params)
print("Training completed successfully!")
except Exception as e:
print(f"Training failed with error: {e}")
raise
if __name__ == "__main__":
main()训练成功截图:


修正后训练均衡:


绘制训练loss曲线的脚本:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('results.csv')
plt.figure(figsize=(15, 10))
# 绘制训练loss
plt.subplot(2, 2, 1)
plt.plot(df['epoch'], df['train/box_loss'], label='Train Box Loss', marker='o', markersize=3)
plt.plot(df['epoch'], df['val/box_loss'], label='Val Box Loss', marker='s', markersize=3)
plt.title('Box Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)
# 绘制分类loss
plt.subplot(2, 2, 2)
plt.plot(df['epoch'], df['train/cls_loss'], label='Train Class Loss', marker='o', markersize=3)
plt.plot(df['epoch'], df['val/cls_loss'], label='Val Class Loss', marker='s', markersize=3)
plt.title('Classification Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('training_curves.png', dpi=300, bbox_inches='tight')
plt.show()训练损失曲线:

使用训练得到的best.pt文件导出为ONNX格式:
import torch
checkpoint = torch.load('best.pt', map_location='cpu')
if isinstance(checkpoint, dict):
if 'model' in checkpoint:
model = checkpoint['model']
elif 'state_dict' in checkpoint:
model = checkpoint['state_dict']
else:
model = checkpoint
else:
model = checkpoint
model = model.float()
model.eval()
dummy_input = torch.randn(1, 3, 640, 640).float()
torch.onnx.export(
model,
dummy_input,
'best.onnx',
export_params=True,
opset_version=17,
do_constant_folding=True,
input_names=['images'],
output_names=['output']
)
print("转换完成!")验证ONNX模型:
bash
python -c "import onnx; model = onnx.load('best.onnx'); onnx.checker.check_model(model); print('ONNX模型验证通过!')"ONNX验证截图:

使用ATC工具将ONNX模型转换为OM格式:
bash
#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
ONNX_MODEL="yolov13.onnx"
OM_MODEL="yolov13"
CHIP_TYPE="Ascend910B3"
INPUT_SHAPE="images:1,3,640,640"
PRECISION="allow_mix_precision"
echo "开始转换ONNX模型到OM格式..."
atc --model=$ONNX_MODEL \
--framework=5 \
--output=$OM_MODEL \
--input_format=NCHW \
--input_shape=$INPUT_SHAPE \
--precision_mode=$PRECISION \
--soc_version=$CHIP_TYPE
if [ $? -eq 0 ]; then
echo "模型转换成功!OM文件已保存为: $OM_MODEL"
else
echo "模型转换失败,请检查错误信息。"
exit 1
fi注意:使用 --soc_version=Ascend910B3 而非 --chip_version;使用 --precision_mode=allow_mix_precision 而非 fp16。
转换成功截图:

推理环境版本查询:

创建图片转二进制脚本jpg_bin.py:
import cv2, numpy as np, glob, os
os.makedirs("bin", exist_ok=True)
for f in glob.glob("coco10/images/*.jpg"):
img = cv2.imread(f)
img = cv2.resize(img, (640, 640))
img = img.transpose(2, 0, 1)[None, ...].astype(np.float32)
bin_name = f"bin/" + os.path.basename(f)[:-4] + ".bin"
img.tofile(bin_name)使用ais_bench执行推理:
bash
python jpg_bin.py
python3 -m ais_bench --model yolov13.om --input /home/modelzoo-GPL/built-in/ACL_Pytorch/Yolov13_for_PyTorch/yolov13/bin --output ascend_out/ --loop 100 --batchsize 5推理脚本:
import cv2
import numpy as np
import argparse
import os
from ais_bench.infer.interface import InferSession
import random
COCO_CLASSES = [
'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',
'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
'hair drier', 'toothbrush'
]
def get_color(idx):
idx = idx * 3
color = ((37 * idx) % 255, (17 * idx) % 255, (29 * idx) % 255)
return tuple(map(int, color))
def preprocess_image(image_path, input_size=640):
img = cv2.imread(image_path)
if img is None:
raise ValueError(f"Failed to read image: {image_path}")
orig_h, orig_w = img.shape[:2]
img_resized = cv2.resize(img, (input_size, input_size))
img_rgb = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB)
img_normalized = img_rgb.astype(np.float32) / 255.0
img_chw = np.transpose(img_normalized, (2, 0, 1))
img_input = np.expand_dims(img_chw, axis=0)
return img_input, img, (orig_h, orig_w)
def scale_coords_fixed(img1_shape, coords, img0_shape):
gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])
pad_x = (img1_shape[1] - img0_shape[1] * gain) / 2
pad_y = (img1_shape[0] - img0_shape[0] * gain) / 2
coords[:, [0, 2]] -= pad_x
coords[:, [1, 3]] -= pad_y
coords[:, :4] /= gain
coords[:, [0, 2]] = np.clip(coords[:, [0, 2]], 0, img0_shape[1])
coords[:, [1, 3]] = np.clip(coords[:, [1, 3]], 0, img0_shape[0])
return coords
def postprocess(outputs, conf_thres=0.5, iou_thres=0.5, orig_shape=None):
try:
output = outputs[0]
output = output.reshape(output.shape[-1], output.shape[-2])
boxes = output[:, :4]
scores = output[:, 4:]
class_scores = np.max(scores, axis=1)
class_ids = np.argmax(scores, axis=1)
valid_indices = class_scores > conf_thres
boxes = boxes[valid_indices]
class_scores = class_scores[valid_indices]
class_ids = class_ids[valid_indices]
if len(boxes) == 0:
return np.array([]), np.array([]), np.array([])
boxes_xyxy = np.copy(boxes)
boxes_xyxy[:, 0] = boxes[:, 0] - boxes[:, 2] / 2
boxes_xyxy[:, 1] = boxes[:, 1] - boxes[:, 3] / 2
boxes_xyxy[:, 2] = boxes[:, 0] + boxes[:, 2] / 2
boxes_xyxy[:, 3] = boxes[:, 1] + boxes[:, 3] / 2
indices = cv2.dnn.NMSBoxes(
boxes_xyxy.tolist(),
class_scores.tolist(),
conf_thres,
iou_thres
)
if len(indices) > 0:
if isinstance(indices[0], list) or isinstance(indices[0], np.ndarray):
indices = [idx[0] for idx in indices]
else:
indices = indices.tolist()
else:
indices = []
final_boxes = boxes_xyxy[indices]
final_scores = class_scores[indices]
final_class_ids = class_ids[indices]
if orig_shape and len(final_boxes) > 0:
final_boxes = scale_coords_fixed((640, 640), final_boxes, orig_shape)
return final_boxes, final_scores, final_class_ids
except Exception as e:
return np.array([]), np.array([]), np.array([])
def draw_detections(image, boxes, scores, class_ids):
if len(boxes) == 0:
return image
h, w = image.shape[:2]
for i in range(len(boxes)):
box = boxes[i]
score = scores[i]
class_id = int(class_ids[i])
x1, y1, x2, y2 = map(int, box)
x1 = max(0, min(x1, w))
y1 = max(0, min(y1, h))
x2 = max(0, min(x2, w))
y2 = max(0, min(y2, h))
if x2 <= x1 or y2 <= y1:
continue
class_name = COCO_CLASSES[class_id] if class_id < len(COCO_CLASSES) else f"Class {class_id}"
color = get_color(class_id)
cv2.rectangle(image, (x1, y1), (x2, y2), color, 2)
label = f"{class_name}: {score:.2f}"
label_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 2)[0]
cv2.rectangle(image, (x1, y1 - label_size[1] - 10), (x1 + label_size[0], y1), color, -1)
cv2.putText(image, label, (x1, y1 - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)
return image
def main():
parser = argparse.ArgumentParser(description='YOLOv13 inference with OM model')
parser.add_argument('-m', '--model', required=True, help='OM model file path')
parser.add_argument('-i', '--input', required=True, help='Input image path or directory')
parser.add_argument('--input-size', type=int, default=640, help='Model input size')
parser.add_argument('--threshold', type=float, default=0.5, help='Confidence threshold')
parser.add_argument('--output', default='output.jpg', help='Output image path')
args = parser.parse_args()
session = InferSession(0, args.model)
if os.path.isdir(args.input):
image_files = [f for f in os.listdir(args.input) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
if not image_files:
raise ValueError(f"No image files found in directory: {args.input}")
image_path = os.path.join(args.input, random.choice(image_files))
else:
image_path = args.input
img_input, orig_img, (orig_h, orig_w) = preprocess_image(image_path, args.input_size)
outputs = session.infer([img_input])
boxes, scores, class_ids = postprocess(
outputs,
conf_thres=args.threshold,
iou_thres=0.5,
orig_shape=(orig_h, orig_w)
)
if len(boxes) > 0:
result_img = draw_detections(orig_img, boxes, scores, class_ids)
else:
result_img = orig_img
cv2.imwrite(args.output, result_img)
if __name__ == "__main__":
main()推理成功截图:


ais_bench评估完成截图:

推理结果图片:

推理成功后可以正常检测COCO数据集中的80类物体。OM模型推理输出与原始PyTorch模型检测结果一致,验证了模型迁移的正确性。
现象:训练时出现 NotImplementedError: Could not run 'npu::npu_prompt_flash_attention' with arguments from the 'CPU' backend 错误,显示该算子仅适用于NPU后端但张量在CPU上。
可能原因:模型中的Flash Attention算子未正确适配NPU环境,张量被错误分配到CPU设备。
处理方法:修改block.py中AAttn类的forward方法,添加设备检测和回退逻辑,当NPU Flash Attention不可用时使用标准注意力实现。
现象:启用AMP时出现 torch._dynamo.exc.Unsupported: hasattr: PythonModuleVariable() 错误。
原因:TorchDynamo编译工具对hasattribute函数不支持。
临时处理方法:通过环境变量禁用TorchDynamo动态编译。
长期处理方式:可以通过 try-except 配合Dynamo的特定异常来控制哪些部分不需要编译
校验命令:
TORCHDYNAMO_DISABLE=1 python train.py
现象:训练过程中出现 Out of Memory 错误。
可能原因:单卡显存不足支持当前batch size和模型尺寸。
处理方法:更换显存更大的NPU卡。
校验命令: npu-smi info
现象:设置使用多卡训练时,进程集中在单卡运行,其他卡负载很低。
可能原因:设备分配和环境变量配置不正确。
处理方法:在导入torch前设置环境变量,使用LOCAL_RANK映射到物理设备。
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7 export HCCL_WHITELIST_DISABLE=1 export HCCL_DETERMINISTIC=false export HCCL_CONNECT_TIMEOUT=600
现象:多卡训练时出现 RuntimeError: HCCL error: uncorrect code=0x0000000b 错误。
可能原因:HCCL白名单限制或网络配置问题。
处理方法:禁用HCCL白名单,设置非确定性模式,配置网络环境变量 HCCL_IF_IP 和 HCCL_IF_NAME。