ai4s领域分子动力学软件gromacs软件鲲鹏加速优化方案
GROMACS是一款开源、跨平台的高性能分子动力学模拟软件,专为模拟蛋白质、核酸、脂质等生物大分子体系而设计,通过模拟体系中原子或分子随时间的运动轨迹,从而揭示其结构、动力学和热力学性质,广泛应用于生物化学与药物研发、材料科学与化工等领域研究。
当前镜像中已经集成了gromacs在鲲鹏920B处理器和鲲鹏920F处理器上运行的版本及环境,可以直接在容器中执行。所使用到的组件版本如下:
硬件:kunpeng920B,kunpeng920F
软件配套:
| 组件 | 版本 |
|---|---|
| OpenEuler | 22.03-lts-sp4 |
| groamcs | 2023.03 |
| HPCKit | 25.1.0 |
| fftw | 3.3.10 |
| openblas | 0.3.24 |
| bisheng | 3.2.0 |
镜像下载后,加载镜像
docker load -i kunpeng_gromacs.tar
根据镜像创建一个容器
docker run -dit \
--network host \
--shm-size=1g \
--privileged \
--name gromacs \
kunpeng_gromacs:v1gromacs的目录在容器中/workspace
docker exec -it gromacs /bin/bash
cd /workspacs
容器中gromacs的目录如下:
├── /workspace
│ ├── gromacs_case # 测试用例,data02、data03
│ ├── gromacs_output_920b # 编译好的920B版本gromacs
│ └── gromacs_output_920f #编译好的920F版本gromacs以在鲲鹏920B服务器上执行为例,进入到对应的gromacs目录
cd gromacs_output_920b
参考命令如下,具体的需要结合用例的原子数量、用例特点进行选择
sh run_gromacs.sh 1 data03 320 80 1 1000 202604 && sleep 10
最终可以得到类似如下的结果,在精度满足要求的前提下,(ns/day)的结果越大,性能越好:
Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 13.5%.
The balanceable part of the MD step is 78%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 10.5%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 1.017
Part of the total run time spent waiting due to PP/PME imbalance: 1.3 %
NOTE: 10.5 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.
Core t (s) Wall t (s) (%)
Time: 152881.360 477.769 31999.0
(ns/day) (hour/ns)
Performance: 361.681 0.066gromacs的启动命令参考如下:
mpirun -np ${MPI_NUM} --allow-run-as-root -x UCX_TLS=sm --bind-to cpulist:ordered --mca coll ^ucg -mca pml ucx -mca btl ^vader,tcp,openib,uct -mca io romio321 --mca opal_common_ucx_tls any --mca coll_tuned_use_dynamic_rules true gmx_mpi mdrun -v -deffnm md -noconfout -pin on -nsteps ${STEP_NUM} -npme ${PME_NUM} -ntomp ${OMP_NUM} -s ${entry} -g ${log_file}核心围绕鲲鹏 CPU 的多核架构、NUMA 拓扑、ARM 原生通信库、线程绑定等特性做深度调优,优化手段可分为4 大类
鲲鹏 CPU 是多核多线程、多 NUMA 节点的 ARM 架构处理器(典型如鲲鹏 920:64 核 / 128 核,多 NUMA 节点),命令通过MPI+OpenMP 混合并行完美适配鲲鹏架构,是最核心的优化手段。
-np ${MPI_NUM}:设置 MPI 进程数(对应鲲鹏 NUMA 节点数 / 物理核心分组)-ntomp ${OMP_NUM}:设置每个 MPI 进程的 OpenMP 线程数(对应鲲鹏物理核心数)总核心数 = MPI进程数 × OpenMP线程数鲲鹏 ARM64 架构不依赖 x86 的通信库,命令强制使用UCX 高性能通信框架,是鲲鹏 MPI 通信的专属优化。
-x UCX_TLS=sm:指定 UCX 仅使用 共享内存(shared memory) 通信-mca pml ucx:强制 OpenMPI 使用 UCX 作为核心通信抽象层--mca opal_common_ucx_tls any:开放 UCX 全通信能力,适配鲲鹏网卡 / 内存UCX_TLS=sm关闭网络通信,仅用鲲鹏高速共享内存,适配单机多核并行(鲲鹏单机多核场景下,共享内存通信延迟比 TCP 低一个数量级);-mca btl ^vader,tcp,openib,uct 关闭 x86 架构的老旧通信模块,避免 ARM 架构下的兼容性损耗。鲲鹏 CPU 是多簇、多 NUMA、有序物理核心布局,命令通过硬绑定核心解决 ARM 架构线程漂移问题,是鲲鹏算力无损输出的关键。
--bind-to cpulist:ordered:将 MPI 进程按顺序绑定到物理 CPU 核心列表
ordered严格按物理顺序绑定,完美匹配鲲鹏核心布局;针对鲲鹏 ARM64 浮点计算能力、IO 架构做的应用层优化,最大化分子动力学计算效率。
-mca io romio321:鲲鹏架构专属的 MPI-IO 优化组件--mca coll_tuned_use_dynamic_rules true:动态集合通信规则gmx_mpi:鲲鹏编译的 ARM 原生 GROMACS 二进制gmx_mpi基于鲲鹏 ARM64 NEON 指令集、SVE 向量指令集编译,鲲鹏的向量计算单元性能比 x86 同频高 15%~20%;romio321适配鲲鹏分布式存储 / 本地存储,解决 ARM 架构下 MPI-IO 并行写入瓶颈;如需基准测试,执行脚本
sh test_cases.sh一个鲲鹏920F的算力和耗时部分关键日志样例
...
DD step 79 vol min/aver 0.828 load imb.: force 20.9% pme mesh/force 1.159
Step Time
100 0.20000
Energies (kJ/mol)
Bond Angle Proper Dih. Per. Imp. Dih. LJ-14
3.67509e+03 1.01173e+04 1.47569e+04 5.04886e+02 4.48732e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
5.05150e+04 9.33396e+04 -5.39806e+03 -9.04123e+05 3.70303e+03
Potential Kinetic En. Total Energy Conserved En. Temperature
-7.28422e+05 1.38581e+05 -5.89841e+05 -5.91019e+05 3.01208e+02
Pres. DC (bar) Pressure (bar) Constr. rmsd
-1.65681e+02 -4.57648e+01 3.45385e-06
Energy conservation over simulation part #1 of length 0.2 ps, time 0 to 0.2 ps
Conserved energy drift: -1.97e-03 kJ/mol/ps per atom
<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>
Statistics over 101 steps using 2 frames
Energies (kJ/mol)
Bond Angle Proper Dih. Per. Imp. Dih. LJ-14
3.63777e+03 1.01396e+04 1.47194e+04 5.19234e+02 4.48868e+03
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
5.04255e+04 9.33731e+04 -5.39682e+03 -9.04528e+05 3.70537e+03
Potential Kinetic En. Total Energy Conserved En. Temperature
-7.28916e+05 1.38481e+05 -5.90436e+05 -5.91008e+05 3.00990e+02
Pres. DC (bar) Pressure (bar) Constr. rmsd
-1.65605e+02 -6.59635e+01 0.00000e+00
Box-X Box-Y Box-Z
9.14969e+00 9.14969e+00 6.46981e+00
Total Virial (kJ/mol)
4.85213e+04 2.57473e+03 3.28731e+02
2.57434e+03 4.76508e+04 -1.35160e+03
3.30048e+02 -1.35016e+03 4.55362e+04
Pressure (bar)
-1.38811e+02 -1.57895e+02 -1.01795e+01
-1.57871e+02 -1.00869e+02 8.36721e+01
-1.02602e+01 8.35837e+01 4.17899e+01
T-Protein_MOL T-NA_CL_SOL
3.02141e+02 3.00856e+02
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 95.615368 860.538 0.3
NxN Ewald Elec. + LJ [F] 2395.619024 158110.856 51.1
NxN Ewald Elec. + LJ [V&F] 48.475232 5186.850 1.7
NxN Ewald Elec. [F] 1990.680944 121431.538 39.2
NxN Ewald Elec. [V&F] 40.309696 3386.014 1.1
1,4 nonbonded interactions 1.228261 110.543 0.0
Calc Weights 16.402602 590.494 0.2
Spread Q Bspline 349.922176 699.844 0.2
Gather F Bspline 349.922176 2099.533 0.7
3D-FFT 1961.517162 15692.137 5.1
Solve PME 2.585600 165.478 0.1
Reset In Box 0.108268 0.325 0.0
CG-CoM 0.162402 0.487 0.0
Bonds 0.238259 14.057 0.0
Angles 0.855167 143.668 0.0
Propers 1.511465 346.125 0.1
Impropers 0.093627 19.474 0.0
Virial 0.625174 11.253 0.0
Stop-CM 0.108268 1.083 0.0
Calc-Ekin 1.190948 32.156 0.0
Lincs 0.270345 16.221 0.0
Lincs-Mat 1.500504 6.002 0.0
Constraint-V 6.225897 56.033 0.0
Constraint-Vir 0.648756 15.570 0.0
Settle 1.895069 701.176 0.2
-----------------------------------------------------------------------------
Total 309697.456 100.0
-----------------------------------------------------------------------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 169058.0
av. #atoms communicated per step for LINCS: 2 x 7701.7
Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 31.5%.
The balanceable part of the MD step is 51%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 16.1%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 1.055
Part of the total run time spent waiting due to PP/PME imbalance: 2.6 %
NOTE: 16.1 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 60 MPI ranks doing PP, and
on 20 MPI ranks doing PME
Activity: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
--------------------------------------------------------------------------------
Domain decomp. 60 1 2 0.008 0.032 1.3
DD comm. load 60 1 2 0.000 0.000 0.0
DD comm. bounds 60 1 2 0.000 0.002 0.1
Send X to PME 60 1 101 0.011 0.046 1.9
Neighbor search 60 1 3 0.015 0.062 2.5
Comm. coord. 60 1 98 0.011 0.047 1.9
Force 60 1 101 0.199 0.841 34.3
Wait + Comm. F 60 1 101 0.033 0.141 5.8
PME mesh * 20 1 101 0.293 0.412 16.8
PME wait for PP * 0.142 0.199 8.1
Wait + Recv. PME F 60 1 101 0.075 0.317 12.9
NB X/F buffer ops. 60 1 297 0.006 0.023 1.0
Write traj. 60 1 1 0.001 0.004 0.2
Update 60 1 101 0.013 0.056 2.3
Constraints 60 1 101 0.031 0.131 5.4
Comm. energies 60 1 11 0.021 0.088 3.6
Rest 0.011 0.048 1.9
--------------------------------------------------------------------------------
Total 0.436 2.450 100.0
--------------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.
--------------------------------------------------------------------------------
Breakdown of PME mesh activities
--------------------------------------------------------------------------------
PME redist. X/F 20 1 202 0.030 0.042 1.7
PME spread 20 1 101 0.044 0.062 2.5
PME gather 20 1 101 0.056 0.079 3.2
PME 3D-FFT 20 1 202 0.137 0.193 7.9
PME 3D-FFT Comm. 20 1 404 0.021 0.029 1.2
PME solve Elec 20 1 101 0.004 0.005 0.2
--------------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 24.316 0.436 5581.5
(ns/day) (hour/ns)
Performance: 40.062 0.599
Finished mdrun on rank 0 Tue Apr 21 17:58:35 2026