PocketXMol 昇腾迁移与适配

一、模型概述与使用场景

开源仓地址：https://github.com/pengxingang/PocketXMol

https://hub.baai.ac.cn/view/52822 清华2月份在Cell期刊上发表的论文

PocketXMol 是一个面向蛋白口袋–分子相互作用的原子级生成基础模型。其核心是一个 E(3)-等变几何图神经网络去噪器，支持多种等变图神经网络实现（包括基于几何向量感知 GVP 的变体），并在 pocket encoder 与去噪网络中可选集成基于帧的 Invariant Point Attention（IPA）式几何注意力模块；生成过程采用类扩散的多步去噪框架，但与标准扩散模型不同，它不依赖预定义噪声分布与扩散时间步，而是通过统一的任务提示自动识别并去除噪声，从而在同一模型内统一小分子对接/设计、肽设计等多种口袋相关任务。

所有涉及的代码修改都包含在PXM_all.patch中，直接git apply pxm_all.patch

二、环境搭建

使用官方CANN8.5的镜像：

docker pull quay.io/ascend/cann:8.5.0-a3-ubuntu22.04-py3.11

export IMAGE=quay.io/ascend/cann:8.5.0-a3-ubuntu22.04-py3.11
docker run -it -d --net=host \
    --name PXM_test \
    --shm-size=1g \
    --privileged \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /mnt/share_space/XXX/:/mnt/share_space/XXX/ \
    -it $IMAGE bash

进入容器，选择路径：

git clone https://github.com/pengxingang/PocketXMol.git

进入PockerXMol目录，创建并启用python虚拟环境PXM：

python -m venv PXM
source PXM/bin/activate

安装依赖：

pip install torch==2.7.1
pip install torch-npu==2.7.1
pip install pytorch-lightning==2.6.1
pip install torch_geometric==2.7.0

# Bio/Chem informatics
pip install biopython==1.83 rdkit==2023.9.3 peptidebuilder==1.1.0
pip install openbabel-wheel==3.1.1.23  # or: conda install -c conda-forge openbabel -y

# Utilities
pip install lmdb==1.7.5 easydict==1.9 numpy==1.24 pandas==1.5.2 scipy==1.10.1
pip install tensorboard==2.20.0  # for training only
pip install decorator
pip install pyyaml
# PyG extensions (must match torch version + CUDA tag) NPU上默认运行setup.py进行编译为cpu版本
# pip install torch_scatter torch_sparse torch_cluster

安装PyG依赖

所有 PyG 扩展库必须从源码编译安装，禁止使用 pip install torch_scatter -f https://data.pyg.org/whl/... 等预编译 wheel 方式。PyPI / PyG 预编译包基于 CUDA 或 x86 CPU 构建，在 aarch64 + NPU 环境下可能存在 ABI 不兼容或算子缺失问题，必须从源码编译：

source /usr/local/Ascend/ascend-toolkit/set_env.sh

git clone https://github.com/rusty1s/pytorch_scatter.git
git clone https://github.com/rusty1s/pytorch_cluster.git
git clone https://github.com/rusty1s/pytorch_sparse.git

cd pytorch_sparse
git submodule update --init --recursive
cd ..

# 分别进入目录执行
cd pytorch_scatter && python setup.py bdist_wheel && pip install dist/*.whl --force-reinstall && cd ..
cd pytorch_cluster && python setup.py bdist_wheel && pip install dist/*.whl --force-reinstall && cd ..
cd pytorch_sparse  && python setup.py bdist_wheel && pip install dist/*.whl --force-reinstall && cd ..

源码编译过后支持aarch64 CPU，不支持NPU，相关算子需要先放到CPU执行

下载推理权重

wget -c https://zenodo.org/records/17801271/files/model_weights.tar.gz
tar -zxvf model_weights.tar.gz

下载Benching测试集

wget -c https://zenodo.org/records/17801271/files/data_test.tar.gz
tar -zxvf data_test.tar.gz

三、执行推理

scripts/sample_use.py

引入torch_npu进行迁移：

import torch_npu
from torch_npu.contrib import transfer_to_npu

将涉及的torch_scatter和torch_cluster调用修改为CPU执行，再迁移至NPU上

例如：

diff --git a/models/graph_context.py b/models/graph_context.py
index ef40f69..14e5b31 100644
--- a/models/graph_context.py
+++ b/models/graph_context.py
@@ -378,8 +378,15 @@ class ContextNodeEdgeNet(Module):
                 pos_ctx_noised = pos_ctx + torch.randn_like(pos_ctx) * 5  # works like masked position information
             else:
                 pos_ctx_noised = pos_ctx
-            ctx_knn_edge_index = knn(y=pos, x=pos_ctx_noised, k=self.knn,
-                                    batch_x=batch_ctx, batch_y=batch_node)
+            # ctx_knn_edge_index = knn(y=pos, x=pos_ctx_noised, k=self.knn,
+            #                         batch_x=batch_ctx, batch_y=batch_node)
+            ctx_knn_edge_index = knn(
+    y=pos.cpu(), 
+    x=pos_ctx_noised.cpu(), 
+    k=self.knn,
+    batch_x=batch_ctx.cpu(),
+    batch_y=batch_node.cpu()
+).to(pos.device)
         else: # fully connected x-yf
             device = pos.device
             ctx_knn_edge_index = []

统一生成脚本：scripts/sample_use.py 常用参数（文档 Quick Start）：

--config_task <path>（必填）：任务配置 YAML，决定具体任务类型（对接/SBDD/片段连接/多肽设计/优化等）。
--config_model <path>（可选，默认 configs/sample/pxm.yml）：模型配置（权重路径、网络结构），通常无需修改。
--outdir <dir>（可选，默认 ./outputs_use）：输出根目录。
--device <str>（可选，默认 cuda:0）。
--batch_size <int>（可选）：覆盖 YAML 中的 sample.batch_size，以避免内存溢出（OOM）。

所有示例配置均位于 configs/sample/examples/ 目录下。

1、小分子对接

python scripts/sample_use.py --config_task configs/sample/examples/dock_smallmol.yml --outdir outputs_examples --device npu:0

2、多肽对接

python scripts/sample_use.py --config_task configs/sample/examples/dock_pep.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

3、小分子设计

3.1、基于结构的药物设计（SBDD）：

python scripts/sample_use.py --config_task configs/sample/examples/sbdd.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

diff --git a/utils/sample_noise.py b/utils/sample_noise.py
index 870af54..73726ea 100644
--- a/utils/sample_noise.py
+++ b/utils/sample_noise.py
@@ -1420,8 +1420,15 @@ class MaskfillSampleNoiser(BaseSampleNoiser):
             center_pos = node_pos[is_center_p2]
             batch_center = batch_node[is_center_p2]
             # select neighbor
-            assign_index = radius(x=node_pos, y=center_pos, r=r,
-                                  batch_x=batch_node, batch_y=batch_center)
+            # assign_index = radius(x=node_pos, y=center_pos, r=r,
+            #                       batch_x=batch_node, batch_y=batch_center)
+            assign_index = radius(
+    x=node_pos.cpu(), 
+    y=center_pos.cpu(), 
+    r=r,
+    batch_x=batch_node.cpu(),
+    batch_y=batch_center.cpu()
+).to(node_pos.device)

3.2、片段连接（Fragment linking）

python scripts/sample_use.py --config_task configs/sample/examples/linking_fixed_frags.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

3.3、Fragment growing（片段生长）

python scripts/sample_use.py --config_task configs/sample/examples/growing_fixed_frag.yml --config_model configs/sample/pxm.yml --outdir outputs_examples -device npu:0

3.4、分子优化（Molecular Optimization）

python scripts/sample_use.py --config_task configs/sample/examples/opt_mol.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

4、多肽设计

全新多肽生成：

python scripts/sample_use.py --config_task configs/sample/examples/pepdesign_denovo.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

带约束的多肽设计（固定部分残基/类型）：

pepdesign_fix_pos_and_type.yml - 部分残基固定，部分类型固定，位置可变

逆折叠：

python scripts/sample_use.py --config_task configs/sample/examples/pepdesign_invfold.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

5、置信度打分

from easydict import EasyDict

# Add safe globals for easydict.EasyDict to avoid unpickling error
torch.serialization.add_safe_globals([EasyDict])

python scripts/believe_use_pdb.py --exp_name pepdesign_denovo_pxm --result_root outputs_examples --config configs/sample/confidence/tuned_cfd.yml --device npu:0

--exp_name：实验目录名的子串（例如 pepdesign_denovo_pxm 会匹配 pepdesign_pxm_xxx）。
--result_root：生成输出根目录（与前面 --outdir 对应）。
--config：选择排序配置，文档给出两种：
- configs/sample/confidence/tuned_cfd.yml（调优排序模型）；
- configs/sample/confidence/flex_cfd.yml（用柔性对接噪声打分）。

四、Benchmark结果复现和性能测试

https://github.com/pengxingang/PocketXMol/blob/master/docs/sample_test_sets.md

sample_drug3d 引入

import torch_npu
from torch_npu.contrib import transfer_to_npu

1、小分子对接（PoseBusters）

数据集：428个蛋白质-配体对。

1.1

python scripts/sample_drug3d.py --config_task configs/sample/test/dock_poseboff/base.yml --outdir outputs_test/dock_posebusters_npu --device npu:0

第一遍运行完成后，第二遍或中途 DataLoader worker 突然出现段错误，这是非常典型的 多进程 Fork 与 NPU 驱动/LMDB 底层库冲突 问题。

此处不能使用 spawn 模式，因为这需要将 Dataset 对象进行 pickle 序列化后发送给子进程，而 LMDB 的 Environment 对象不支持 pickle。若不开启多进程 worker，可在 sample_drug3d.py 中创建 test_loader（即 DataLoader）的位置，找到 num_workers 参数并将其改为 0。

运行下来平均每轮 3 分 16 秒图片描述

1.2、置信度打分：

在 torch.load() 中添加 weights_only=False

将 Dataloader 的 num_workers 参数改为 0

python scripts/believe.py --exp_name base_pxm --result_root outputs_test/dock_posebusters_npu --config configs/sample/confidence/tuned_cfd.yml --device npu:0

图片描述

1.3、排序（rank_pose.py）：

python scripts/rank_pose.py --exp_name base_pxm --result_root outputs_test/dock_posebusters_npu --db poseboff

图片描述

1.4、评估RMSD：

python evaluate/evaluate_dock.py --exp_name base_pxm --result_root outputs_test/dock_posebusters_npu --db poseboff --use_repeats 50

图片描述 seed改为2026 测试结果 0.838785/0.841121/0.962617 对比论文结果：

2、肽对接（PepBDB 线性肽 79例）

sample_pdb.py 引入：

import torch_npu
from torch_npu.contrib import transfer_to_npu

将 num_workers 改为 0

python scripts/sample_pdb.py --config_task configs/sample/test/dock_pepbdb/base.yml --outdir outputs_test/dock_pepbdb --device cuda:0

6分04秒一轮图片描述

五、训练

下载训练数据

wget -c https://zenodo.org/records/17801271/files/data_train_processed.tar.gz
tar -zxvf data_train_processed.tar.gz

Lightning会调用torch.cuda.get_device_capability接口，迁移后在昇腾NPU平台上运行时，会返回"None"值。

Lightning适配见 https://gitcode.com/AI4Science/AscendSkills/blob/main/models/boltz2/SKILL.md - 为 PyTorch Lightning 注册 npu accelerator部分，用agent实现

适配后命令：

python scripts/train_pl.py --config configs/train/train_pxm_reduced.yml --num_gpus 1 --accelerator npu

图片描述

train_pl.py 适配修改总结：

1. 新增文件

models/npu_accelerator.py - 注册 Lightning NPU 加速器

2. 修改 scripts/train_pl.py

修改位置	修改内容
导入部分	添加 torch_npu 和 transfer_to_npu 导入
命令行参数	添加 --accelerator 参数（gpu/cpu/npu）
Trainer 创建	注册 NPU 加速器、添加 NPU 策略适配（DDPStrategy）、添加混合精度插件
DataModule.setup	NPU 模式下自动设置 num_workers=0，并确保 effective_num_workers>=1 避免除零错误
DataLoader	NPU 模式下禁用 persistent_workers 避免段错误
on_before_optimizer_step	修复梯度范数返回值的设备问题（将 norms 移到 NPU 设备再取最大值）

3. 使用方式

python scripts/train_pl.py --accelerator npu --num_gpus 1 --config configs/train/train_pxm_reduced.yml

4. 注意事项

NPU 模式下使用 num_workers=0 避免 DataLoader 多进程段错误（segfault）

PocketXMol 昇腾迁移与适配

一、模型概述与使用场景

开源仓地址：https://github.com/pengxingang/PocketXMol

https://hub.baai.ac.cn/view/52822 清华2月份在Cell期刊上发表的论文

所有涉及的代码修改都包含在PXM_all.patch中，直接git apply pxm_all.patch

二、环境搭建

使用官方CANN8.5的镜像：

docker pull quay.io/ascend/cann:8.5.0-a3-ubuntu22.04-py3.11

export IMAGE=quay.io/ascend/cann:8.5.0-a3-ubuntu22.04-py3.11
docker run -it -d --net=host \
    --name PXM_test \
    --shm-size=1g \
    --privileged \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /mnt/share_space/XXX/:/mnt/share_space/XXX/ \
    -it $IMAGE bash

进入容器，选择路径：

git clone https://github.com/pengxingang/PocketXMol.git

进入PockerXMol目录，创建并启用python虚拟环境PXM：

python -m venv PXM
source PXM/bin/activate

安装依赖：

pip install torch==2.7.1
pip install torch-npu==2.7.1
pip install pytorch-lightning==2.6.1
pip install torch_geometric==2.7.0

# Bio/Chem informatics
pip install biopython==1.83 rdkit==2023.9.3 peptidebuilder==1.1.0
pip install openbabel-wheel==3.1.1.23  # or: conda install -c conda-forge openbabel -y

# Utilities
pip install lmdb==1.7.5 easydict==1.9 numpy==1.24 pandas==1.5.2 scipy==1.10.1
pip install tensorboard==2.20.0  # for training only
pip install decorator
pip install pyyaml
# PyG extensions (must match torch version + CUDA tag) NPU上默认运行setup.py进行编译为cpu版本
# pip install torch_scatter torch_sparse torch_cluster

安装PyG依赖

source /usr/local/Ascend/ascend-toolkit/set_env.sh

git clone https://github.com/rusty1s/pytorch_scatter.git
git clone https://github.com/rusty1s/pytorch_cluster.git
git clone https://github.com/rusty1s/pytorch_sparse.git

cd pytorch_sparse
git submodule update --init --recursive
cd ..

# 分别进入目录执行
cd pytorch_scatter && python setup.py bdist_wheel && pip install dist/*.whl --force-reinstall && cd ..
cd pytorch_cluster && python setup.py bdist_wheel && pip install dist/*.whl --force-reinstall && cd ..
cd pytorch_sparse  && python setup.py bdist_wheel && pip install dist/*.whl --force-reinstall && cd ..

源码编译过后支持aarch64 CPU，不支持NPU，相关算子需要先放到CPU执行

下载推理权重

wget -c https://zenodo.org/records/17801271/files/model_weights.tar.gz
tar -zxvf model_weights.tar.gz

下载Benching测试集

wget -c https://zenodo.org/records/17801271/files/data_test.tar.gz
tar -zxvf data_test.tar.gz

三、执行推理

scripts/sample_use.py

引入torch_npu进行迁移：

import torch_npu
from torch_npu.contrib import transfer_to_npu

将涉及的torch_scatter和torch_cluster调用修改为CPU执行，再迁移至NPU上

例如：

diff --git a/models/graph_context.py b/models/graph_context.py
index ef40f69..14e5b31 100644
--- a/models/graph_context.py
+++ b/models/graph_context.py
@@ -378,8 +378,15 @@ class ContextNodeEdgeNet(Module):
                 pos_ctx_noised = pos_ctx + torch.randn_like(pos_ctx) * 5  # works like masked position information
             else:
                 pos_ctx_noised = pos_ctx
-            ctx_knn_edge_index = knn(y=pos, x=pos_ctx_noised, k=self.knn,
-                                    batch_x=batch_ctx, batch_y=batch_node)
+            # ctx_knn_edge_index = knn(y=pos, x=pos_ctx_noised, k=self.knn,
+            #                         batch_x=batch_ctx, batch_y=batch_node)
+            ctx_knn_edge_index = knn(
+    y=pos.cpu(), 
+    x=pos_ctx_noised.cpu(), 
+    k=self.knn,
+    batch_x=batch_ctx.cpu(),
+    batch_y=batch_node.cpu()
+).to(pos.device)
         else: # fully connected x-yf
             device = pos.device
             ctx_knn_edge_index = []

统一生成脚本：scripts/sample_use.py 常用参数（文档 Quick Start）：

--config_task <path>（必填）：任务配置 YAML，决定具体任务类型（对接/SBDD/片段连接/多肽设计/优化等）。
--config_model <path>（可选，默认 configs/sample/pxm.yml）：模型配置（权重路径、网络结构），通常无需修改。
--outdir <dir>（可选，默认 ./outputs_use）：输出根目录。
--device <str>（可选，默认 cuda:0）。
--batch_size <int>（可选）：覆盖 YAML 中的 sample.batch_size，以避免内存溢出（OOM）。

所有示例配置均位于 configs/sample/examples/ 目录下。

1、小分子对接

python scripts/sample_use.py --config_task configs/sample/examples/dock_smallmol.yml --outdir outputs_examples --device npu:0

2、多肽对接

python scripts/sample_use.py --config_task configs/sample/examples/dock_pep.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

3、小分子设计

3.1、基于结构的药物设计（SBDD）：

python scripts/sample_use.py --config_task configs/sample/examples/sbdd.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

diff --git a/utils/sample_noise.py b/utils/sample_noise.py
index 870af54..73726ea 100644
--- a/utils/sample_noise.py
+++ b/utils/sample_noise.py
@@ -1420,8 +1420,15 @@ class MaskfillSampleNoiser(BaseSampleNoiser):
             center_pos = node_pos[is_center_p2]
             batch_center = batch_node[is_center_p2]
             # select neighbor
-            assign_index = radius(x=node_pos, y=center_pos, r=r,
-                                  batch_x=batch_node, batch_y=batch_center)
+            # assign_index = radius(x=node_pos, y=center_pos, r=r,
+            #                       batch_x=batch_node, batch_y=batch_center)
+            assign_index = radius(
+    x=node_pos.cpu(), 
+    y=center_pos.cpu(), 
+    r=r,
+    batch_x=batch_node.cpu(),
+    batch_y=batch_center.cpu()
+).to(node_pos.device)

3.2、片段连接（Fragment linking）

python scripts/sample_use.py --config_task configs/sample/examples/linking_fixed_frags.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

3.3、Fragment growing（片段生长）

python scripts/sample_use.py --config_task configs/sample/examples/growing_fixed_frag.yml --config_model configs/sample/pxm.yml --outdir outputs_examples -device npu:0

3.4、分子优化（Molecular Optimization）

python scripts/sample_use.py --config_task configs/sample/examples/opt_mol.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

4、多肽设计

全新多肽生成：

python scripts/sample_use.py --config_task configs/sample/examples/pepdesign_denovo.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

带约束的多肽设计（固定部分残基/类型）：

pepdesign_fix_pos_and_type.yml - 部分残基固定，部分类型固定，位置可变

逆折叠：

python scripts/sample_use.py --config_task configs/sample/examples/pepdesign_invfold.yml --config_model configs/sample/pxm.yml --outdir outputs_examples --device npu:0

5、置信度打分

from easydict import EasyDict

# Add safe globals for easydict.EasyDict to avoid unpickling error
torch.serialization.add_safe_globals([EasyDict])

python scripts/believe_use_pdb.py --exp_name pepdesign_denovo_pxm --result_root outputs_examples --config configs/sample/confidence/tuned_cfd.yml --device npu:0

--exp_name：实验目录名的子串（例如 pepdesign_denovo_pxm 会匹配 pepdesign_pxm_xxx）。
--result_root：生成输出根目录（与前面 --outdir 对应）。
--config：选择排序配置，文档给出两种：
- configs/sample/confidence/tuned_cfd.yml（调优排序模型）；
- configs/sample/confidence/flex_cfd.yml（用柔性对接噪声打分）。

四、Benchmark结果复现和性能测试

https://github.com/pengxingang/PocketXMol/blob/master/docs/sample_test_sets.md

sample_drug3d 引入

import torch_npu
from torch_npu.contrib import transfer_to_npu

1、小分子对接（PoseBusters）

数据集：428个蛋白质-配体对。

1.1

python scripts/sample_drug3d.py --config_task configs/sample/test/dock_poseboff/base.yml --outdir outputs_test/dock_posebusters_npu --device npu:0

第一遍运行完成后，第二遍或中途 DataLoader worker 突然出现段错误，这是非常典型的 多进程 Fork 与 NPU 驱动/LMDB 底层库冲突 问题。

运行下来平均每轮 3 分 16 秒图片描述

1.2、置信度打分：

在 torch.load() 中添加 weights_only=False

将 Dataloader 的 num_workers 参数改为 0

python scripts/believe.py --exp_name base_pxm --result_root outputs_test/dock_posebusters_npu --config configs/sample/confidence/tuned_cfd.yml --device npu:0

图片描述

1.3、排序（rank_pose.py）：

python scripts/rank_pose.py --exp_name base_pxm --result_root outputs_test/dock_posebusters_npu --db poseboff

图片描述

1.4、评估RMSD：

python evaluate/evaluate_dock.py --exp_name base_pxm --result_root outputs_test/dock_posebusters_npu --db poseboff --use_repeats 50

图片描述 seed改为2026 测试结果 0.838785/0.841121/0.962617 对比论文结果：

2、肽对接（PepBDB 线性肽 79例）

sample_pdb.py 引入：

import torch_npu
from torch_npu.contrib import transfer_to_npu

将 num_workers 改为 0

python scripts/sample_pdb.py --config_task configs/sample/test/dock_pepbdb/base.yml --outdir outputs_test/dock_pepbdb --device cuda:0

6分04秒一轮图片描述

五、训练

下载训练数据

wget -c https://zenodo.org/records/17801271/files/data_train_processed.tar.gz
tar -zxvf data_train_processed.tar.gz

Lightning会调用torch.cuda.get_device_capability接口，迁移后在昇腾NPU平台上运行时，会返回"None"值。

Lightning适配见 https://gitcode.com/AI4Science/AscendSkills/blob/main/models/boltz2/SKILL.md - 为 PyTorch Lightning 注册 npu accelerator部分，用agent实现

适配后命令：

python scripts/train_pl.py --config configs/train/train_pxm_reduced.yml --num_gpus 1 --accelerator npu

图片描述

train_pl.py 适配修改总结：

1. 新增文件

models/npu_accelerator.py - 注册 Lightning NPU 加速器

2. 修改 scripts/train_pl.py

修改位置	修改内容
导入部分	添加 torch_npu 和 transfer_to_npu 导入
命令行参数	添加 --accelerator 参数（gpu/cpu/npu）
Trainer 创建	注册 NPU 加速器、添加 NPU 策略适配（DDPStrategy）、添加混合精度插件
DataModule.setup	NPU 模式下自动设置 num_workers=0，并确保 effective_num_workers>=1 避免除零错误
DataLoader	NPU 模式下禁用 persistent_workers 避免段错误
on_before_optimizer_step	修复梯度范数返回值的设备问题（将 norms 移到 NPU 设备再取最大值）

3. 使用方式

python scripts/train_pl.py --accelerator npu --num_gpus 1 --config configs/train/train_pxm_reduced.yml

4. 注意事项

NPU 模式下使用 num_workers=0 避免 DataLoader 多进程段错误（segfault）