ai4s_kunpeng_gramacs

ai4s领域分子动力学软件gromacs软件鲲鹏加速优化方案

1.gromacs概述与使用场景

GROMACS是一款开源、跨平台的高性能分子动力学模拟软件，专为模拟蛋白质、核酸、脂质等生物大分子体系而设计，通过模拟体系中原子或分子随时间的运动轨迹，从而揭示其结构、动力学和热力学性质，广泛应用于生物化学与药物研发、材料科学与化工等领域研究。

2.版本基线

当前镜像中已经集成了gromacs在鲲鹏920B处理器和鲲鹏920F处理器上运行的版本及环境，可以直接在容器中执行。所使用到的组件版本如下：

硬件：kunpeng920B，kunpeng920F

软件配套：

组件	版本
OpenEuler	22.03-lts-sp4
groamcs	2023.03
HPCKit	25.1.0
fftw	3.3.10
openblas	0.3.24
bisheng	3.2.0

3.运行指导

3.1 加载镜像

镜像下载后，加载镜像 docker load -i kunpeng_gromacs.tar

根据镜像创建一个容器

docker run -dit \
 --network host \
 --shm-size=1g \
 --privileged  \
 --name gromacs \
 kunpeng_gromacs:v1

3.2 运行gromacs

gromacs的目录在容器中/workspace

docker exec -it gromacs /bin/bash

cd /workspacs

容器中gromacs的目录如下：

├── /workspace
│   ├── gromacs_case  # 测试用例，data02、data03
│   ├── gromacs_output_920b # 编译好的920B版本gromacs
│   └── gromacs_output_920f #编译好的920F版本gromacs

以在鲲鹏920B服务器上执行为例，进入到对应的gromacs目录

cd gromacs_output_920b

参考命令如下，具体的需要结合用例的原子数量、用例特点进行选择

sh run_gromacs.sh 1 data03 320 80 1 1000 202604 && sleep 10

最终可以得到类似如下的结果，在精度满足要求的前提下，（ns/day）的结果越大，性能越好：

Dynamic load balancing report:
 DLB was permanently on during the run per user request.
 Average load imbalance: 13.5%.
 The balanceable part of the MD step is 78%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 10.5%.
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
 Average PME mesh/force load: 1.017
 Part of the total run time spent waiting due to PP/PME imbalance: 1.3 %

NOTE: 10.5 % of the available CPU time was lost due to load imbalance
      in the domain decomposition.
      You can consider manually changing the decomposition (option -dd);
      e.g. by using fewer domains along the box dimension in which there is
      considerable inhomogeneity in the simulated system.

               Core t (s)   Wall t (s)        (%)
       Time:   152881.360      477.769    31999.0
                 (ns/day)    (hour/ns)
Performance:      361.681        0.066

4. 性能优化

gromacs的启动命令参考如下：

mpirun -np ${MPI_NUM} --allow-run-as-root -x UCX_TLS=sm --bind-to cpulist:ordered --mca coll ^ucg -mca pml ucx -mca btl ^vader,tcp,openib,uct -mca io romio321 --mca opal_common_ucx_tls any --mca coll_tuned_use_dynamic_rules true gmx_mpi mdrun -v -deffnm md -noconfout -pin on -nsteps ${STEP_NUM} -npme ${PME_NUM} -ntomp ${OMP_NUM} -s ${entry} -g ${log_file}

核心围绕鲲鹏 CPU 的多核架构、NUMA 拓扑、ARM 原生通信库、线程绑定等特性做深度调优，优化手段可分为4 大类

4.1 鲲鹏 CPU 核心架构优化：混合并行（MPI+OpenMP）

鲲鹏 CPU 是多核多线程、多 NUMA 节点的 ARM 架构处理器（典型如鲲鹏 920：64 核 / 128 核，多 NUMA 节点），命令通过MPI+OpenMP 混合并行完美适配鲲鹏架构，是最核心的优化手段。

4.1.1关键参数

-np ${MPI_NUM}：设置 MPI 进程数（对应鲲鹏 NUMA 节点数 / 物理核心分组）
-ntomp ${OMP_NUM}：设置每个 MPI 进程的 OpenMP 线程数（对应鲲鹏物理核心数）
组合逻辑：总核心数 = MPI进程数 × OpenMP线程数

4.1.2鲲鹏优化原理

鲲鹏 CPU不推荐超线程（超线程会降低 ARM 架构浮点计算效率），混合并行严格绑定物理核心，杜绝超线程资源浪费；
鲲鹏 NUMA 架构下，MPI 进程绑定单个 NUMA 节点，OpenMP 线程绑定 NUMA 内的物理核心，避免跨 NUMA 访问内存（鲲鹏跨 NUMA 延迟远高于同 NUMA）；
GROMACS 在鲲鹏上MPI+OpenMP 1:16/1:32是最优配比，完美匹配鲲鹏 64/128 核的多核密度。

4.2 鲲鹏 CPU 通信优化：ARM 原生 UCX 通信库（核心优化）

鲲鹏 ARM64 架构不依赖 x86 的通信库，命令强制使用UCX 高性能通信框架，是鲲鹏 MPI 通信的专属优化。

4.2.1 关键参数

-x UCX_TLS=sm：指定 UCX 仅使用 共享内存（shared memory） 通信
-mca pml ucx：强制 OpenMPI 使用 UCX 作为核心通信抽象层
--mca opal_common_ucx_tls any：开放 UCX 全通信能力，适配鲲鹏网卡 / 内存

4.2.2 鲲鹏优化原理

ARM 原生优化：UCX 是鲲鹏官方推荐的 MPI 通信库，针对 ARM64 指令集、鲲鹏内存架构做了汇编级优化，比 x86 通信库（如 BTL）性能提升 30%+；
共享内存专属优化：UCX_TLS=sm关闭网络通信，仅用鲲鹏高速共享内存，适配单机多核并行（鲲鹏单机多核场景下，共享内存通信延迟比 TCP 低一个数量级）；
禁用无效通信组件：-mca btl ^vader,tcp,openib,uct 关闭 x86 架构的老旧通信模块，避免 ARM 架构下的兼容性损耗。

4.3 鲲鹏 CPU 线程绑定优化：CPU 核心精准绑定（避免性能抖动）

鲲鹏 CPU 是多簇、多 NUMA、有序物理核心布局，命令通过硬绑定核心解决 ARM 架构线程漂移问题，是鲲鹏算力无损输出的关键。

4.3.1 关键参数

--bind-to cpulist:ordered：将 MPI 进程按顺序绑定到物理 CPU 核心列表

4.3.2 鲲鹏优化原理

鲲鹏核心拓扑适配：鲲鹏 920 的物理核心是按簇分组、有序编号（0-63 为 NUMA0，64-127 为 NUMA1），ordered严格按物理顺序绑定，完美匹配鲲鹏核心布局；
杜绝线程漂移：ARM 架构 Linux 内核调度比 x86 更激进，不绑定会导致线程在核心间频繁切换，鲲鹏上性能损失可达 20%~50%；
缓存亲和性：绑定后线程独占鲲鹏 L1/L2/L3 缓存，避免缓存失效（鲲鹏缓存体系是 NUMA 本地化设计，核心绑定能最大化缓存命中率）。

4.4 鲲鹏 CPU 计算 & IO 优化：GROMACS 专属调优

针对鲲鹏 ARM64 浮点计算能力、IO 架构做的应用层优化，最大化分子动力学计算效率。

4.4.1 关键参数

-mca io romio321：鲲鹏架构专属的 MPI-IO 优化组件
--mca coll_tuned_use_dynamic_rules true：动态集合通信规则
gmx_mpi：鲲鹏编译的 ARM 原生 GROMACS 二进制

4.4.2 鲲鹏优化原理

ARM 浮点计算优化：gmx_mpi基于鲲鹏 ARM64 NEON 指令集、SVE 向量指令集编译，鲲鹏的向量计算单元性能比 x86 同频高 15%~20%；
IO 优化：romio321适配鲲鹏分布式存储 / 本地存储，解决 ARM 架构下 MPI-IO 并行写入瓶颈；
动态通信：鲲鹏多核场景下，动态通信规则根据核心数自动调整通信算法，避免多核阻塞。

基准测试

如需基准测试，执行脚本

sh test_cases.sh

附录

一个鲲鹏920F的算力和耗时部分关键日志样例

...
DD  step 79  vol min/aver 0.828  load imb.: force 20.9%  pme mesh/force 1.159
           Step           Time
            100        0.20000

   Energies (kJ/mol)
           Bond          Angle    Proper Dih. Per. Imp. Dih.          LJ-14
    3.67509e+03    1.01173e+04    1.47569e+04    5.04886e+02    4.48732e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    5.05150e+04    9.33396e+04   -5.39806e+03   -9.04123e+05    3.70303e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -7.28422e+05    1.38581e+05   -5.89841e+05   -5.91019e+05    3.01208e+02
 Pres. DC (bar) Pressure (bar)   Constr. rmsd
   -1.65681e+02   -4.57648e+01    3.45385e-06


Energy conservation over simulation part #1 of length 0.2 ps, time 0 to 0.2 ps
  Conserved energy drift: -1.97e-03 kJ/mol/ps per atom


	<======  ###############  ==>
	<====  A V E R A G E S  ====>
	<==  ###############  ======>

	Statistics over 101 steps using 2 frames

   Energies (kJ/mol)
           Bond          Angle    Proper Dih. Per. Imp. Dih.          LJ-14
    3.63777e+03    1.01396e+04    1.47194e+04    5.19234e+02    4.48868e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    5.04255e+04    9.33731e+04   -5.39682e+03   -9.04528e+05    3.70537e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -7.28916e+05    1.38481e+05   -5.90436e+05   -5.91008e+05    3.00990e+02
 Pres. DC (bar) Pressure (bar)   Constr. rmsd
   -1.65605e+02   -6.59635e+01    0.00000e+00

          Box-X          Box-Y          Box-Z
    9.14969e+00    9.14969e+00    6.46981e+00

   Total Virial (kJ/mol)
    4.85213e+04    2.57473e+03    3.28731e+02
    2.57434e+03    4.76508e+04   -1.35160e+03
    3.30048e+02   -1.35016e+03    4.55362e+04

   Pressure (bar)
   -1.38811e+02   -1.57895e+02   -1.01795e+01
   -1.57871e+02   -1.00869e+02    8.36721e+01
   -1.02602e+01    8.35837e+01    4.17899e+01

  T-Protein_MOL    T-NA_CL_SOL
    3.02141e+02    3.00856e+02


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check              95.615368         860.538     0.3
 NxN Ewald Elec. + LJ [F]              2395.619024      158110.856    51.1
 NxN Ewald Elec. + LJ [V&F]              48.475232        5186.850     1.7
 NxN Ewald Elec. [F]                   1990.680944      121431.538    39.2
 NxN Ewald Elec. [V&F]                   40.309696        3386.014     1.1
 1,4 nonbonded interactions               1.228261         110.543     0.0
 Calc Weights                            16.402602         590.494     0.2
 Spread Q Bspline                       349.922176         699.844     0.2
 Gather F Bspline                       349.922176        2099.533     0.7
 3D-FFT                                1961.517162       15692.137     5.1
 Solve PME                                2.585600         165.478     0.1
 Reset In Box                             0.108268           0.325     0.0
 CG-CoM                                   0.162402           0.487     0.0
 Bonds                                    0.238259          14.057     0.0
 Angles                                   0.855167         143.668     0.0
 Propers                                  1.511465         346.125     0.1
 Impropers                                0.093627          19.474     0.0
 Virial                                   0.625174          11.253     0.0
 Stop-CM                                  0.108268           1.083     0.0
 Calc-Ekin                                1.190948          32.156     0.0
 Lincs                                    0.270345          16.221     0.0
 Lincs-Mat                                1.500504           6.002     0.0
 Constraint-V                             6.225897          56.033     0.0
 Constraint-Vir                           0.648756          15.570     0.0
 Settle                                   1.895069         701.176     0.2
-----------------------------------------------------------------------------
 Total                                                  309697.456   100.0
-----------------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 169058.0
 av. #atoms communicated per step for LINCS:  2 x 7701.7


Dynamic load balancing report:
 DLB was permanently on during the run per user request.
 Average load imbalance: 31.5%.
 The balanceable part of the MD step is 51%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 16.1%.
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
 Average PME mesh/force load: 1.055
 Part of the total run time spent waiting due to PP/PME imbalance: 2.6 %

NOTE: 16.1 % of the available CPU time was lost due to load imbalance
      in the domain decomposition.
      You can consider manually changing the decomposition (option -dd);
      e.g. by using fewer domains along the box dimension in which there is
      considerable inhomogeneity in the simulated system.

      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 60 MPI ranks doing PP, and
on 20 MPI ranks doing PME

 Activity:              Num   Num      Call    Wall time         Giga-Cycles
                        Ranks Threads  Count      (s)         total sum    %
--------------------------------------------------------------------------------
 Domain decomp.           60    1          2       0.008          0.032   1.3
 DD comm. load            60    1          2       0.000          0.000   0.0
 DD comm. bounds          60    1          2       0.000          0.002   0.1
 Send X to PME            60    1        101       0.011          0.046   1.9
 Neighbor search          60    1          3       0.015          0.062   2.5
 Comm. coord.             60    1         98       0.011          0.047   1.9
 Force                    60    1        101       0.199          0.841  34.3
 Wait + Comm. F           60    1        101       0.033          0.141   5.8
 PME mesh *               20    1        101       0.293          0.412  16.8
 PME wait for PP *                                 0.142          0.199   8.1
 Wait + Recv. PME F       60    1        101       0.075          0.317  12.9
 NB X/F buffer ops.       60    1        297       0.006          0.023   1.0
 Write traj.              60    1          1       0.001          0.004   0.2
 Update                   60    1        101       0.013          0.056   2.3
 Constraints              60    1        101       0.031          0.131   5.4
 Comm. energies           60    1         11       0.021          0.088   3.6
 Rest                                              0.011          0.048   1.9
--------------------------------------------------------------------------------
 Total                                             0.436          2.450 100.0
--------------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
--------------------------------------------------------------------------------
 Breakdown of PME mesh activities
--------------------------------------------------------------------------------
 PME redist. X/F          20    1        202       0.030          0.042   1.7
 PME spread               20    1        101       0.044          0.062   2.5
 PME gather               20    1        101       0.056          0.079   3.2
 PME 3D-FFT               20    1        202       0.137          0.193   7.9
 PME 3D-FFT Comm.         20    1        404       0.021          0.029   1.2
 PME solve Elec           20    1        101       0.004          0.005   0.2
--------------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:       24.316        0.436     5581.5
                 (ns/day)    (hour/ns)
Performance:       40.062        0.599
Finished mdrun on rank 0 Tue Apr 21 17:58:35 2026