| 项目 | 版本 / 规格 |
|---|---|
| 操作系统 / 架构 | openEuler 24.03 (LTS) / aarch64 |
| 驱动 / 固件 | 25.2.0 / 7.7.0.3.220 |
| CANN | 8.5.0 |
| Python | 3.11.6 |
| torch / torch_npu | 2.1.0 / 2.1.0.post18 |
从 AscendHub 下载 MindIE 镜像,根据硬件和系统架构选择对应版本,本文档中使用的版本为2.3.0-800I-A2-py311-openeuler24.03-lts。
创建 Docker 容器,根据需要修改镜像和容器名称:
IMAGE=swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.3.0-800I-A2-py311-openeuler24.03-lts
NAME=mindie-2.3.0
docker run -itd --net=host \
--privileged \
--ipc=host \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-w /workspace \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /etc/hccn.conf:/etc/hccn.conf:ro \
--name $NAME \
$IMAGE /bin/bash进入 Docker 容器:
docker exec -it mindie-2.3.0 bash安装 Python 依赖:
pip install -U openmim
# 限制 numpy 版本,否则会出现版本不兼容
cd /workspace
echo "numpy<2" > constraints.txt
mim install mmengine -c constraints.txt
# 编译安装时间较长
mim install mmpretrain -c constraints.txt
# 安装其它依赖
pip uninstall -y opencv-python opencv-python-headless
pip install -U opencv-python-headless -c constraints.txt
pip install aenum
pip install onnxruntime下载MMPreTrain和MMDeploy 代码,后续训练会使用
cd /workspace
git clone https://github.com/open-mmlab/mmpretrain
git clone https://github.com/open-mmlab/mmdeploy.git下载 CIFAR-10 数据集和 DINOv2 基础模型权重:
cd /workspace
mkdir dataset
cd dataset
# 下载数据集
wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
# 解压
tar xzf cifar-10-python.tar.gz
# 下载基础模型
cd /workspace
mkdir models && cd models
wget https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth由于原始DINOv2训练模型的键名与 MMPretrain 框架内的名称不一致,需要编写脚本进行转换,创建脚本文件:convert_dinov2_ckpt.py,执行以下命令进行转换,其中--src_ckpt和--dst_ckpt分别为原始模型和转换后的模型。
python convert_dinov2_ckpt.py \
--src_ckpt /workspace/models/dinov2_vits14_pretrain.pth \
--dst_ckpt /workspace/models/dinov2_vits14_pretrain_mmpretrain.pth编写训练配置文件 dinov2_train.py,主要涉及的配置如下,详细信息查看文件:
mkdir /workspace/train
vim /workspace/train/dinov2_train.py执行以下命令使用 NPU 开始后台训练,其中 ASCEND_RT_VISIBLE_DEVICES 用于指定训练时使用的卡号:
nohup bash -c '
ASCEND_RT_VISIBLE_DEVICES=1 \
python /workspace/mmpretrain/tools/train.py /workspace/train/dinov2_train.py' \
> /workspace/train/train.log 2>&1 &单卡训练情况查看:
npu-smi info日志信息
以下命令使用 0、1、2、3 共 4 张卡和29501端口进行训练,末尾的参数 4 表示训练用的卡的数量:
nohup bash -c '
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
PORT=29501 \
bash /workspace/mmpretrain/tools/dist_train.sh /workspace/train/dinov2_train.py 4' \
> /workspace/train/train.log 2>&1 &多卡训练情况查看:
查看训练后保存的 Checkpoints 文件:
根据 loss 值绘制图像,loss 值正常收敛,准确率逐步上升:

基于原始文件创建部署配置文件,添加 shape 信息:
cp /workspace/mmdeploy/configs/mmpretrain/classification_onnxruntime_static.py /workspace/mmdeploy/configs/mmpretrain/classification_onnxruntime_static_224-224.py
vim /workspace/mmdeploy/configs/mmpretrain/classification_onnxruntime_static_224-224.py修改后的文件如下:
_base_ = ['./classification_static.py', '../_base_/backends/onnxruntime.py']
onnx_config = dict(input_shape=[224, 224])使用 mmdeploy 将 pth 权重文件转换为 ONNX 模型文件,注意下面的best_accuracy_top1_epoch_99.pth参数需要根据实际训练结果修改为对应的文件名 :
PYTHONPATH=/workspace/mmdeploy:$PYTHONPATH
python /workspace/mmdeploy/tools/deploy.py \
/workspace/mmdeploy/configs/mmpretrain/classification_onnxruntime_static_224-224.py \
/workspace/train/dinov2_train.py \
/workspace/train/work_dirs/dinov2_cifar10_pretrain_mmpretrain/best_accuracy_top1_epoch_99.pth \
/workspace/mmpretrain/demo/bird.JPEG \
--work-dir /workspace/convert_onnx \
--show使用 atc 工具将 ONNX 模型转换为 OM 文件,注意 --input_shape 参数与前面转换 ONNX 时的参数配置保持一致,且 soc_version 需要修改为对应的 NPU 卡型号:
atc --model=/workspace/convert_onnx/end2end.onnx \
--framework=5 \
--output=/workspace/convert_om \
--input_format=NCHW \
--input_shape="input:1,3,224,224" \
--soc_version=Ascend910B3 \
--precision_mode=allow_mix_precision查看转换结果:
安装 ais_bench:
yum install -y zlib-devel
cd /workspace
pip install -v 'git+https://gitee.com/ascend/tools.git#egg=aclruntime&subdirectory=ais-bench_workload/tool/ais_bench/backend' -c constraints.txt
pip install -v 'git+https://gitee.com/ascend/tools.git#egg=ais_bench&subdirectory=ais-bench_workload/tool/ais_bench' -c constraints.txt创建推理脚本并执行:infer_by_om_model.py:
mkdir /workspace/infer
vim /workspace/infer/infer_by_om_model.py
python /workspace/infer/infer_by_om_model.py \
-m /workspace/convert_om.om \
-i /workspace/mmpretrain/demo/dog.jpg \
-o /workspace/om_infer_output查看推理结果:
现象:训练过程中产生以下key不匹配的告警:
04/07 09:03:30 - mmengine - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: backbone.cls_token, backbone.pos_embed, backbone.mask_token, backbone.patch_embed.projection.weight, backbone.patch_embed.projection.bias, backbone.layers.0.ln1.weight, backbone.layers.0.ln1.bias, backbone.layers.0.attn.qkv.weight, backbone.layers.0.attn.qkv.bias, backbone.layers.0.attn.proj.weight, backbone.layers.0.attn.proj.bias, backbone.layers.0.attn.gamma1.weight, backbone.layers.0.ln2.weight, backbone.layers.0.ln2.bias, backbone.layers.0.ffn.layers.0.0.weight, backbone.layers.0.ffn.layers.0.0.bias, backbone.layers.0.ffn.layers.1.weight...
...
missing keys in source state_dict: cls_token, pos_embed, patch_embed.projection.weight, patch_embed.projection.bias, layers.0.ln1.weight, layers.0.ln1.bias, layers.0.attn.qkv.weight, layers.0.attn.qkv.bias, layers.0.attn.proj.weight, layers.0.attn.proj.bias, layers.0.attn.gamma1.weight, layers.0.ln2.weight, layers.0.ln2.bias, layers.0.ffn.layers.0.0.weight, layers.0.ffn.layers.0.0.bias, layers.0.ffn.layers.1.weight, layers.0.ffn.layers.1.bias, layers.0.ffn.gamma2.weight, layers.1.ln1.weight, layers.1.ln1.bias...可能原因:原始DINOv2训练模型的键名与 MMPretrain 框架内的名称不一致。
处理方法:需要编写脚本对预训练模型转换key后保存,参考上面的步骤5.2 转换预训练模型键名,再基于转换后的模型重新训练。