Redesigning GROMACS Halo Exchange: Improving Strong Scaling with GPU-initiated NVSHMEM
Mahesh Doijade, Andrey Alekseenko, Ania Brown, Alan Gray, Szilárd Páll
Published: 2025/9/25
Abstract
Improving time-to-solution in molecular dynamics simulations often requires strong scaling due to fixed-sized problems. GROMACS is highly latency-sensitive, with peak iteration rates in the sub-millisecond, making scalability on heterogeneous supercomputers challenging. MPI's CPU-centric nature introduces additional latencies on GPU-resident applications' critical path, hindering GPU utilization and scalability. To address these limitations, we present an NVSHMEM-based GPU kernel-initiated redesign of the GROMACS domain decomposition halo-exchange algorithm. Highly tuned GPU kernels fuse data packing and communication, leveraging hardware latency-hiding for fine-grained overlap. We employ kernel fusion across overlapped data forwarding communication phases and utilize the asynchronous copy engine over NVLink to optimize latency and bandwidth. Our GPU-resident formulation greatly increases communication-computation overlap, improving GROMACS strong scaling performance across NVLink by up to 1.5x (intra-node) and 2x (multi-node), and up to 1.3x multi-node over NVLink+InfiniBand. This demonstrates the profound benefits of GPU-initiated communication for strong-scaling a broad range of latency-sensitive applications.