Ascend-SACT/ai4s_kunpeng_gromacs
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

ai4s_kunpeng_gramacs

ai4s领域分子动力学软件gromacs软件鲲鹏加速优化方案

1.gromacs概述与使用场景

GROMACS是一款开源、跨平台的高性能分子动力学模拟软件,专为模拟蛋白质、核酸、脂质等生物大分子体系而设计,通过模拟体系中原子或分子随时间的运动轨迹,从而揭示其结构、动力学和热力学性质,广泛应用于生物化学与药物研发、材料科学与化工等领域研究。

2.版本基线

当前镜像中已经集成了gromacs在鲲鹏920B处理器和鲲鹏920F处理器上运行的版本及环境,可以直接在容器中执行。所使用到的组件版本如下:

硬件:kunpeng920B,kunpeng920F

软件配套:

组件版本
OpenEuler22.03-lts-sp4
groamcs2023.03
HPCKit25.1.0
fftw3.3.10
openblas0.3.24
bisheng3.2.0

3.运行指导

3.1 加载镜像

镜像下载后,加载镜像 docker load -i kunpeng_gromacs.tar

根据镜像创建一个容器

docker run -dit \
 --network host \
 --shm-size=1g \
 --privileged  \
 --name gromacs \
 kunpeng_gromacs:v1

3.2 运行gromacs

gromacs的目录在容器中/workspace

docker exec -it gromacs /bin/bash

cd /workspacs

容器中gromacs的目录如下:

├── /workspace
│   ├── gromacs_case  # 测试用例,data02、data03
│   ├── gromacs_output_920b # 编译好的920B版本gromacs
│   └── gromacs_output_920f #编译好的920F版本gromacs

以在鲲鹏920B服务器上执行为例,进入到对应的gromacs目录

cd gromacs_output_920b

参考命令如下,具体的需要结合用例的原子数量、用例特点进行选择

sh run_gromacs.sh 1 data03 320 80 1 1000 202604 && sleep 10

最终可以得到类似如下的结果,在精度满足要求的前提下,(ns/day)的结果越大,性能越好:

Dynamic load balancing report:
 DLB was permanently on during the run per user request.
 Average load imbalance: 13.5%.
 The balanceable part of the MD step is 78%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 10.5%.
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
 Average PME mesh/force load: 1.017
 Part of the total run time spent waiting due to PP/PME imbalance: 1.3 %

NOTE: 10.5 % of the available CPU time was lost due to load imbalance
      in the domain decomposition.
      You can consider manually changing the decomposition (option -dd);
      e.g. by using fewer domains along the box dimension in which there is
      considerable inhomogeneity in the simulated system.

               Core t (s)   Wall t (s)        (%)
       Time:   152881.360      477.769    31999.0
                 (ns/day)    (hour/ns)
Performance:      361.681        0.066

4. 性能优化

gromacs的启动命令参考如下:

mpirun -np ${MPI_NUM} --allow-run-as-root -x UCX_TLS=sm --bind-to cpulist:ordered --mca coll ^ucg -mca pml ucx -mca btl ^vader,tcp,openib,uct -mca io romio321 --mca opal_common_ucx_tls any --mca coll_tuned_use_dynamic_rules true gmx_mpi mdrun -v -deffnm md -noconfout -pin on -nsteps ${STEP_NUM} -npme ${PME_NUM} -ntomp ${OMP_NUM} -s ${entry} -g ${log_file}

核心围绕鲲鹏 CPU 的多核架构、NUMA 拓扑、ARM 原生通信库、线程绑定等特性做深度调优,优化手段可分为4 大类

4.1 鲲鹏 CPU 核心架构优化:混合并行(MPI+OpenMP)

鲲鹏 CPU 是多核多线程、多 NUMA 节点的 ARM 架构处理器(典型如鲲鹏 920:64 核 / 128 核,多 NUMA 节点),命令通过MPI+OpenMP 混合并行完美适配鲲鹏架构,是最核心的优化手段。

4.1.1关键参数

  1. -np ${MPI_NUM}:设置 MPI 进程数(对应鲲鹏 NUMA 节点数 / 物理核心分组)
  2. -ntomp ${OMP_NUM}:设置每个 MPI 进程的 OpenMP 线程数(对应鲲鹏物理核心数)
  3. 组合逻辑:总核心数 = MPI进程数 × OpenMP线程数

4.1.2鲲鹏优化原理

  1. 鲲鹏 CPU不推荐超线程(超线程会降低 ARM 架构浮点计算效率),混合并行严格绑定物理核心,杜绝超线程资源浪费;
  2. 鲲鹏 NUMA 架构下,MPI 进程绑定单个 NUMA 节点,OpenMP 线程绑定 NUMA 内的物理核心,避免跨 NUMA 访问内存(鲲鹏跨 NUMA 延迟远高于同 NUMA);
  3. GROMACS 在鲲鹏上MPI+OpenMP 1:16/1:32是最优配比,完美匹配鲲鹏 64/128 核的多核密度。

4.2 鲲鹏 CPU 通信优化:ARM 原生 UCX 通信库(核心优化)

鲲鹏 ARM64 架构不依赖 x86 的通信库,命令强制使用UCX 高性能通信框架,是鲲鹏 MPI 通信的专属优化。

4.2.1 关键参数

  1. -x UCX_TLS=sm:指定 UCX 仅使用 共享内存(shared memory) 通信
  2. -mca pml ucx:强制 OpenMPI 使用 UCX 作为核心通信抽象层
  3. --mca opal_common_ucx_tls any:开放 UCX 全通信能力,适配鲲鹏网卡 / 内存

4.2.2 鲲鹏优化原理

  1. ARM 原生优化:UCX 是鲲鹏官方推荐的 MPI 通信库,针对 ARM64 指令集、鲲鹏内存架构做了汇编级优化,比 x86 通信库(如 BTL)性能提升 30%+;
  2. 共享内存专属优化:UCX_TLS=sm关闭网络通信,仅用鲲鹏高速共享内存,适配单机多核并行(鲲鹏单机多核场景下,共享内存通信延迟比 TCP 低一个数量级);
  3. 禁用无效通信组件:-mca btl ^vader,tcp,openib,uct 关闭 x86 架构的老旧通信模块,避免 ARM 架构下的兼容性损耗。

4.3 鲲鹏 CPU 线程绑定优化:CPU 核心精准绑定(避免性能抖动)

鲲鹏 CPU 是多簇、多 NUMA、有序物理核心布局,命令通过硬绑定核心解决 ARM 架构线程漂移问题,是鲲鹏算力无损输出的关键。

4.3.1 关键参数

--bind-to cpulist:ordered:将 MPI 进程按顺序绑定到物理 CPU 核心列表

4.3.2 鲲鹏优化原理

  1. 鲲鹏核心拓扑适配:鲲鹏 920 的物理核心是按簇分组、有序编号(0-63 为 NUMA0,64-127 为 NUMA1),ordered严格按物理顺序绑定,完美匹配鲲鹏核心布局;
  2. 杜绝线程漂移:ARM 架构 Linux 内核调度比 x86 更激进,不绑定会导致线程在核心间频繁切换,鲲鹏上性能损失可达 20%~50%;
  3. 缓存亲和性:绑定后线程独占鲲鹏 L1/L2/L3 缓存,避免缓存失效(鲲鹏缓存体系是 NUMA 本地化设计,核心绑定能最大化缓存命中率)。

4.4 鲲鹏 CPU 计算 & IO 优化:GROMACS 专属调优

针对鲲鹏 ARM64 浮点计算能力、IO 架构做的应用层优化,最大化分子动力学计算效率。

4.4.1 关键参数

  1. -mca io romio321:鲲鹏架构专属的 MPI-IO 优化组件
  2. --mca coll_tuned_use_dynamic_rules true:动态集合通信规则
  3. gmx_mpi:鲲鹏编译的 ARM 原生 GROMACS 二进制

4.4.2 鲲鹏优化原理

  1. ARM 浮点计算优化:gmx_mpi基于鲲鹏 ARM64 NEON 指令集、SVE 向量指令集编译,鲲鹏的向量计算单元性能比 x86 同频高 15%~20%;
  2. IO 优化:romio321适配鲲鹏分布式存储 / 本地存储,解决 ARM 架构下 MPI-IO 并行写入瓶颈;
  3. 动态通信:鲲鹏多核场景下,动态通信规则根据核心数自动调整通信算法,避免多核阻塞。

基准测试

如需基准测试,执行脚本

sh test_cases.sh

附录

一个鲲鹏920F的算力和耗时部分关键日志样例

...
DD  step 79  vol min/aver 0.828  load imb.: force 20.9%  pme mesh/force 1.159
           Step           Time
            100        0.20000

   Energies (kJ/mol)
           Bond          Angle    Proper Dih. Per. Imp. Dih.          LJ-14
    3.67509e+03    1.01173e+04    1.47569e+04    5.04886e+02    4.48732e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    5.05150e+04    9.33396e+04   -5.39806e+03   -9.04123e+05    3.70303e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -7.28422e+05    1.38581e+05   -5.89841e+05   -5.91019e+05    3.01208e+02
 Pres. DC (bar) Pressure (bar)   Constr. rmsd
   -1.65681e+02   -4.57648e+01    3.45385e-06


Energy conservation over simulation part #1 of length 0.2 ps, time 0 to 0.2 ps
  Conserved energy drift: -1.97e-03 kJ/mol/ps per atom


	<======  ###############  ==>
	<====  A V E R A G E S  ====>
	<==  ###############  ======>

	Statistics over 101 steps using 2 frames

   Energies (kJ/mol)
           Bond          Angle    Proper Dih. Per. Imp. Dih.          LJ-14
    3.63777e+03    1.01396e+04    1.47194e+04    5.19234e+02    4.48868e+03
     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    5.04255e+04    9.33731e+04   -5.39682e+03   -9.04528e+05    3.70537e+03
      Potential    Kinetic En.   Total Energy  Conserved En.    Temperature
   -7.28916e+05    1.38481e+05   -5.90436e+05   -5.91008e+05    3.00990e+02
 Pres. DC (bar) Pressure (bar)   Constr. rmsd
   -1.65605e+02   -6.59635e+01    0.00000e+00

          Box-X          Box-Y          Box-Z
    9.14969e+00    9.14969e+00    6.46981e+00

   Total Virial (kJ/mol)
    4.85213e+04    2.57473e+03    3.28731e+02
    2.57434e+03    4.76508e+04   -1.35160e+03
    3.30048e+02   -1.35016e+03    4.55362e+04

   Pressure (bar)
   -1.38811e+02   -1.57895e+02   -1.01795e+01
   -1.57871e+02   -1.00869e+02    8.36721e+01
   -1.02602e+01    8.35837e+01    4.17899e+01

  T-Protein_MOL    T-NA_CL_SOL
    3.02141e+02    3.00856e+02


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check              95.615368         860.538     0.3
 NxN Ewald Elec. + LJ [F]              2395.619024      158110.856    51.1
 NxN Ewald Elec. + LJ [V&F]              48.475232        5186.850     1.7
 NxN Ewald Elec. [F]                   1990.680944      121431.538    39.2
 NxN Ewald Elec. [V&F]                   40.309696        3386.014     1.1
 1,4 nonbonded interactions               1.228261         110.543     0.0
 Calc Weights                            16.402602         590.494     0.2
 Spread Q Bspline                       349.922176         699.844     0.2
 Gather F Bspline                       349.922176        2099.533     0.7
 3D-FFT                                1961.517162       15692.137     5.1
 Solve PME                                2.585600         165.478     0.1
 Reset In Box                             0.108268           0.325     0.0
 CG-CoM                                   0.162402           0.487     0.0
 Bonds                                    0.238259          14.057     0.0
 Angles                                   0.855167         143.668     0.0
 Propers                                  1.511465         346.125     0.1
 Impropers                                0.093627          19.474     0.0
 Virial                                   0.625174          11.253     0.0
 Stop-CM                                  0.108268           1.083     0.0
 Calc-Ekin                                1.190948          32.156     0.0
 Lincs                                    0.270345          16.221     0.0
 Lincs-Mat                                1.500504           6.002     0.0
 Constraint-V                             6.225897          56.033     0.0
 Constraint-Vir                           0.648756          15.570     0.0
 Settle                                   1.895069         701.176     0.2
-----------------------------------------------------------------------------
 Total                                                  309697.456   100.0
-----------------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 169058.0
 av. #atoms communicated per step for LINCS:  2 x 7701.7


Dynamic load balancing report:
 DLB was permanently on during the run per user request.
 Average load imbalance: 31.5%.
 The balanceable part of the MD step is 51%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 16.1%.
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
 Average PME mesh/force load: 1.055
 Part of the total run time spent waiting due to PP/PME imbalance: 2.6 %

NOTE: 16.1 % of the available CPU time was lost due to load imbalance
      in the domain decomposition.
      You can consider manually changing the decomposition (option -dd);
      e.g. by using fewer domains along the box dimension in which there is
      considerable inhomogeneity in the simulated system.

      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 60 MPI ranks doing PP, and
on 20 MPI ranks doing PME

 Activity:              Num   Num      Call    Wall time         Giga-Cycles
                        Ranks Threads  Count      (s)         total sum    %
--------------------------------------------------------------------------------
 Domain decomp.           60    1          2       0.008          0.032   1.3
 DD comm. load            60    1          2       0.000          0.000   0.0
 DD comm. bounds          60    1          2       0.000          0.002   0.1
 Send X to PME            60    1        101       0.011          0.046   1.9
 Neighbor search          60    1          3       0.015          0.062   2.5
 Comm. coord.             60    1         98       0.011          0.047   1.9
 Force                    60    1        101       0.199          0.841  34.3
 Wait + Comm. F           60    1        101       0.033          0.141   5.8
 PME mesh *               20    1        101       0.293          0.412  16.8
 PME wait for PP *                                 0.142          0.199   8.1
 Wait + Recv. PME F       60    1        101       0.075          0.317  12.9
 NB X/F buffer ops.       60    1        297       0.006          0.023   1.0
 Write traj.              60    1          1       0.001          0.004   0.2
 Update                   60    1        101       0.013          0.056   2.3
 Constraints              60    1        101       0.031          0.131   5.4
 Comm. energies           60    1         11       0.021          0.088   3.6
 Rest                                              0.011          0.048   1.9
--------------------------------------------------------------------------------
 Total                                             0.436          2.450 100.0
--------------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.
--------------------------------------------------------------------------------
 Breakdown of PME mesh activities
--------------------------------------------------------------------------------
 PME redist. X/F          20    1        202       0.030          0.042   1.7
 PME spread               20    1        101       0.044          0.062   2.5
 PME gather               20    1        101       0.056          0.079   3.2
 PME 3D-FFT               20    1        202       0.137          0.193   7.9
 PME 3D-FFT Comm.         20    1        404       0.021          0.029   1.2
 PME solve Elec           20    1        101       0.004          0.005   0.2
--------------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:       24.316        0.436     5581.5
                 (ns/day)    (hour/ns)
Performance:       40.062        0.599
Finished mdrun on rank 0 Tue Apr 21 17:58:35 2026