SparseTrack是一个简单而强大的多目标跟踪器,提出了一种新的多目标跟踪方法,通过伪深度估计和深度级联匹配策略来分解密集场景。这种方法在MOT17和MOT20基准测试中表现出色,仅使用IoU匹配就达到了与复杂算法相当的性能。SparseTrack为解决拥挤场景中的多目标跟踪问题提供了新的思路,展示了简单方法在复杂任务中的潜力。
本文介绍SparseTrack迁移适配昇腾平台及性能优化指导。
| 配套 | 版本 |
|---|---|
| Python | 3.9 |
| torch | 2.1.0 |
| torch_npu | 2.1.0 |
| torchvision | 0.16.0 |
| 设备型号 | NPU配置 |
|---|---|
| Atlas 800T A2 | 2卡 |
镜像地址:昇腾云各版本配套基础镜像
| 机型 | 镜像名称 | 镜像地址 |
|---|---|---|
| 910B | Pytorch2.1容器镜像 | 内网地址:registry-cbu.huawei.com/atelier/pytorch_2_1_ascend:pytorch_2.1.0-cann_8.0.rc3-py_3.9-hce_2.0.2409-aarch64-snt9b-20241213131522-aafe527 外网地址:swr.cn-southwest-2.myhuaweicloud.com/atelier/pytorch_2_1_ascend:pytorch_2.1.0-cann_8.0.rc3-py_3.9-hce_2.0.2409-aarch64-snt9b-20241213131522-aafe527 |
docker pull registry-cbu.huawei.com/atelier/pytorch_2_1_ascend:pytorch_2.1.0-cann_8.0.rc3-py_3.9-hce_2.0.2409-aarch64-snt9b-20241213131522-aafe527docker run -itd \
-u root \
--privileged \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci3 \
--device=/dev/davinci2 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/bin/hccn_tool:/usr/bin/hccn_tool \
-v /etc/hccn.conf:/etc/hccn.conf \
--shm-size 1024g \
--network=host\
--name SparseTrack \
1813d558bbe2\
/bin/bashdocker exec -it SparseTrack bash
conda create -n SparseTrack --clone PyTorch-2.1.0
conda activate SparseTrack检查发现其他依赖包在选择的镜像中都已包含,无需重复安装,因此只需要安装detectron2这一个框架包即可
pip install 'git+https://github.com/facebookresearch/detectron2.git'cd /home/HwHiAiUser
git clone https://github.com/hustvl/SparseTrack.gitcd SparseTrack
mkdir pretrain
cd pretrain
wget https://github.com/Megvii-BaseDetection/YOLOX/releases/download/0.1.1rc0/yolox_x.pthcd /data/datasets
wget https://motchallenge.net/data/MOT17.zip
unzip MOT17.zipcd /home/HwHiAiUser/SparseTrack
python tools/convert_mot17_to_coco.py# 修改mot17_train_config.py
data_dir=/data/datasets/MOT17
init_checkpoint=/home/HwHiAiUser/SparseTrack/pretrain/yolox_x.pth
# 修改register_data.py
VAL_JSON="/data/datasets/MOT17/annotations/val_half.json"
VAL_PATH="/data/datasets/MOT17/train"# 修改train.py,开头插入自适配代码
import torch_npu
from torch_npu.contrib import transfer_to_npuvi /home/ma-user/anaconda3/envs/SparseTrack/lib/python3.9/site-packages/detectron2/engine/launch.pydist.init_process_group(
backend="HCCL" if has_gpu else "GLOO", #NCCL修改为HCCL
init_method=dist_url,
world_size=world_size,
rank=global_rank,
timeout=timeout,
)vi /home/ma-user/anaconda3/envs/SparseTrack/lib/python3.9/site-packages/detectron2/utils/collect_env.py# 修改前
for k in range(torch.cuda.device_count()):
cap = ".".join((str(x) for x in torch.cuda.get_device_capability(k)))
# 修改后
for k in range(torch.cuda.device_count()):
cap_info = torch.cuda.get_device_capability(k)
if cap_info is not None:
cap = ".".join((str(x) for x in cap_info))
else:
cap = "NPU Device"
pip uninstall datasetsvi /home/ma-user/anaconda3/envs/SparseTrack/lib/python3.9/site-packages/detectron2/engine/train_loop.pyif self.zero_grad_before_forward:
self.optimizer.zero_grad()
with autocast(dtype=torch.bfloat16): #原来的dtype=self.precision修改为dtype=torch.bfloat16
loss_dict = self.model(data)ASCEND_RT_VISIBLE_DEVICES=2,3 python train.py --num-gpus 2 --config-file mot17_train_config.py训练性能:
NPU(2 die):0.81 秒/迭代
GPU(2 卡):0.35 秒/迭代
性能比:NPU=0.86*H100
# 安装gperftools
wget https://github.com/gperftools/gperftools/releases/download/gperftools-2.16/gperftools-2.16.tar.gz
tar -xf gperftools-2.16.tar.gz && cd gperftools-2.16
./configure --prefix=/usr/local/lib --with-tcmalloc-pagesize=64
make
make install
# 找到包位置
find /usr -name libtcmalloc.so*
# 添加环境变量,开启高性能库
export LD_PRELOAD="$LD_PRELOAD:/usr/local/lib/lib/libtcmalloc.so"# 开启二级流水
export TASK_QUEUE_ENABLE=2
# 绑核
export CPU_AFFINITY_CONF=1
# 启用hccl ffts+模式
export ASCEND_ENHANCE_enable=1vi utils/get_optimizer.py # 修改前
optimizer = torch.optim.SGD(
# 修改后
optimizer = torch_npu.optim.NpuFusedSGD(vi train.py # 训练脚本入口添加
torch_npu.npu.set_compile_mode(jit_compile=False)ASCEND_RT_VISIBLE_DEVICES=2,3 python train.py --num-gpus 2 --config-file mot17_train_config.py优化训练性能:
NPU(2 die):0.70 秒/迭代
性能比:NPU=1*H100,性能提升 14%