环境信息:
npu-smi info
+--------------------------------------------------------------------------------------------------------+
| npu-smi 25.3.rc1.b030 Version: 25.3.rc1.b030 |
+-------------------------------+-----------------+------------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) |
| Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) |
+===============================+=================+======================================================+
| 2 310P3 | OK | NA 64 0 / 0 |
| 0 0 | 0000:01:00.0 | 0 1556 / 44280 |
+-------------------------------+-----------------+------------------------------------------------------+
| 2 310P3 | OK | NA 63 0 / 0 |
| 1 1 | 0000:01:00.0 | 0 1413 / 43693 |
+===============================+=================+======================================================+
| 5 310P3 | OK | NA 73 0 / 0 |
| 0 2 | 0000:81:00.0 | 0 1483 / 44280 |
+-------------------------------+-----------------+------------------------------------------------------+
| 5 310P3 | OK | NA 72 0 / 0 |
| 1 3 | 0000:81:00.0 | 0 1481 / 43693 |
+===============================+=================+======================================================+
| 6 310P3 | OK | NA 74 0 / 0 |
| 0 4 | 0000:82:00.0 | 0 1466 / 44280 |
+-------------------------------+-----------------+------------------------------------------------------+
| 6 310P3 | OK | NA 71 0 / 0 |
| 1 5 | 0000:82:00.0 | 0 1500 / 43693 |
+===============================+=================+======================================================+
+-------------------------------+-----------------+------------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===============================+=================+======================================================+
lscpu
架构: aarch64
CPU 运行模式: 64-bit
字节序: Little Endian
CPU: 96
在线 CPU 列表: 0-95
厂商 ID: HiSilicon
BIOS Vendor ID: HiSilicon
型号名称: Kunpeng-920
BIOS Model name: HUAWEI Kunpeng 920 5250
型号: 0
每个核的线程数: 1
每个座的核数: 48
座: 2
步进: 0x1
Frequency boost: disabled
CPU 最大 MHz: 2600.0000
CPU 最小 MHz: 200.0000
BogoMIPS: 200.00
标记: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
Caches (sum of all):
L1d: 6 MiB (96 instances)
L1i: 6 MiB (96 instances)
L2: 48 MiB (96 instances)
L3: 96 MiB (4 instances)
NUMA:
NUMA 节点: 4
NUMA 节点0 CPU: 0-23
NUMA 节点1 CPU: 24-47
NUMA 节点2 CPU: 48-71
NUMA 节点3 CPU: 72-95
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Not affected
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Not affected
Srbds: Not affected
Tsx async abort: Not affected
CANN版本:8.2.RC1
用到的镜像:mindie:2.1.RC1.B152-300I-Duo-py3.11-openeuler24.03-lts-aarch64
##输入npu-smi info,若有卡的信息则表示有NPU卡
npu-smi info##通过能否curl通百度来确定
curl www.baidu.com
##若不能curl通则需要配置代理,cntlm代理的配置参考https://3ms.huawei.com/km/blogs/details/21430554
##例如:
CCW_HOST_IP=90.254.50.56
export http_proxy="http://${CCW_HOST_IP}:3128"
export https_proxy=${http_proxy}
export ftp_proxy=${http_proxy}
export GIT_SSL_NO_VERIFY=true
##其中的CCW_HOST_IP需要更换成VPN的地址
##输入docker -h可以检查
docker -h
##若没有docker则需要安装
yum install docker##以A3的openeuler镜像为例
docker pull mindie:2.1.RC1.B152-300I-Duo-py3.11-openeuler24.03-lts-aarch64
##通过docker images来查看已经pull的镜像
####
[root@localhost ~]# docker images
export IMAGE=mindie:dev-2.2.RC1.B110-300I-Duo-py311-ubuntu22.04-aarch64 && docker run --privileged -u root --name mindie-ljh-1014 --device /dev/davinci4 --device /dev/davinci5 --device /dev/davinci6 --device /dev/davinci7 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info -v /etc/ascend_install.info:/etc/ascend_install.info -v /root/.cache:/root/.cache -p 18050:8080 -it $IMAGE bashyum install -y git
yum install -y patch
yum install -y unzip
yum install -y mesa-libGL
yum install gcc-c++##自己创建一个文件夹 比如/home/ljh,并cd /home/ljh
git clone https://gitee.com/ascend/ModelZoo-PyTorch.git
cd ModelZoo-PyTorch/ACL_PyTorch/built-in/cv/ViTDet_for_Pytorch
###若遇到git访问权限问题可以通过如下命令解决
git config --global --unset http.proxy
git config --global --unset https.proxygit clone https://github.com/open-mmlab/mmdetection.git
git clone https://github.com/open-mmlab/mmengine.git
cd mmdetection
git reset --hard cfd5d3a985b0249de009b67d04f37263e11cdf3d
pip3 install -r requirements.txt
cd ../mmengine
git reset --hard 390ba2fbb272816adfd2883642326d0fd0ca6049
pip3 install -r requirements.txt
cd ..
##从源码安装mmcv
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v2.2.0
pip install -r requirements.txt
MMCV_WITH_OPS=1 pip install -e .
若遇到AttributeError: type object 'Callable' has no attribute '_abc_registry'报错则可通过下面命令解决:
rm -rf /usr/local/lib/python3.11/site-packages/typing.py
rm -rf /usr/local/lib/python3.11/site-packages/typing-*
# 确保用系统对应的 python3.11 来执行
python3.11 -m pip uninstall -y typing
python3.11 -m pip uninstall -y typing_extensions
若遇到Using cached fastmcp-2.9.0-py3-none-any.whl.metadata (17 kB)安装特别慢或aim<=3.17.5安装失败等错误可以先注释以下代码:
然后再通过pip install 单独安装
vim requirements/tests.txt
#aim<=3.17.5;sys_platform!='win32'
bitsandbytes
clearml
coverage
dadaptation
dvclive
lion-pytorch
lmdb
#mlflow
parameterized
pydantic==1.10.9
pytest
transformers
cp mmengine.patch mmengine/mmengine/
cp mmdet.patch mmdetection/
cp infer.py mmdetection/cd mmengine/mmengine/
patch -p2 < mmengine.patch
pip install -v -e ..
cd ../../mmdetection
patch -p1 < mmdet.patchpip install -v -e .. 时遇到 ImportError: cannot import name 'Logger' from partially initialized module 'logging' (most likely due to a circular import) 错误的解决办法cd ViTDet_for_Pytorch/mmengine
pip install -v -e .
###下载权重文件并放到mmdetection/ckpt下
cd mmdetection
mkdir ckpt
cd ckpt
wget --no-check-certificate https://download.openmmlab.com/mmdetection/v3.0/vitdet/vitdet_mask-rcnn_vit-b-mae_lsj-100e/vitdet_mask-rcnn_vit-b-mae_lsj-100e_20230328_153519-e15fe294.pth
###下载数据集
cd ..
mkdir data
cd data
mkdir coco
cd coco
# 下载验证集图片
wget http://images.cocodataset.org/zips/val2017.zip
unzip val2017.zip
rm val2017.zip
# 下载标注
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip annotations_trainval2017.zip
rm annotations_trainval2017.zip
11模型推理
在310P的环境上推理前需修改以下代码:
由于当前numpy的版本较高,直接推理会报错,需要降低numpy的版本:
pip uninstall numpy
pip install numpy==1.24.4
pip install future tensorboardsource /usr/local/Ascend/ascend-toolkit/set_env.sh
python infer.py --cfg projects/ViTDet/configs/vitdet_mask-rcnn_vit-b-mae_lsj-100e.py --ckpt "ckpt/vitdet_mask-rcnn_vit-b-mae_lsj-100e_20230328_153519-e15fe294.pth" --warm_up_times 1
##首先卸载torch等
pip uninstall torch torchvision torch_npu
##然后重新安装torch 等
pip install torch==2.4.0 torch_npu==2.4.0post4 torchvision==0.19.0 torchaudio==2.4.0
##参考:https://www.hiascend.com/document/detail/zh/Pytorch/710/releasenote/releasenote_0003.htmlCCW_HOST_IP=141.4.102.160 # 141.4.124.251
export http_proxy="http://${CCW_HOST_IP}:3128"
export https_proxy=${http_proxy}
export ftp_proxy=${http_proxy}
export GIT_SSL_NO_VERIFY=true参考教程:gitee.com
IMAGES_ID=$1
NAME=$2
if [ $# -ne 2 ]; then
echo "error: need one argument describing your container name."
exit 1
fi
docker run --name ${NAME} -it -d --net=host --shm-size=500g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
--entrypoint=bash \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-e http_proxy=$http_proxy \
-e https_proxy=$https_proxy \
${IMAGES_ID}
bash start-docker.sh 5718854ec902 dev2_rc1_b010
5718854ec902 为IMAGES_ID,dev2_rc1_b010为NAME
pip3 install --trusted-host pypi.org --trusted-host files.pythonhosted.org -r requirements.txt
或者 pip3 install *** -i https://pypi.tuna.tsinghua.edu.cn/simple
报错解决办法:降级numpy包
mkdir -p ~/.pip
vim ~/.pip/pip.conf
[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
timeout = 6000
[install]
trusted-host = pypi.tuna.tsinghua.edu.cn910B上将pse设置为None跑出来的结果:(相比较上面下降30%)