Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters

Palethorpe, EliseStocks, RyanBarca, Giuseppe M.J.2025-05-232025-05-231549-9618PubMed:39586097http://www.scopus.com/inward/record.url?scp=85210314880&partnerID=8YFLogxKhttps://hdl.handle.net/1885/733752344This Article presents two optimized multi-GPU algorithms for Fock matrix construction, building on the work of Ufimtsev and Martinez [ J. Chem. Theory Comput. 2009, 5, 1004-1015 ] and Barca et al. [ J. Chem. Theory Comput. 2021, 17, 7486-7503 ]. The novel algorithms, opt-UM and opt-Brc, introduce significant enhancements, including improved integral screening, exploitation of sparsity and symmetry, a linear scaling exchange matrix assembly algorithm, and extended capabilities for Hartree-Fock caculations up to f-type angular momentum functions. Opt-Brc excels for smaller systems and for highly contracted triple-ζ basis sets, while opt-UM is advantageous for large molecular systems. Performance benchmarks on NVIDIA A100 GPUs show that our algorithms in the EXtreme-scale Electronic Structure System (EXESS), when combined, outperform all current GPU and CPU Fock build implementations in TeraChem, QUICK, GPU4PySCF, LibIntX, ORCA, and Q-Chem. The implementations were benchmarked on linear and globular systems and average speed ups across three double-ζ basis sets of 1.4×, 8.4×, and 9.4× were observed compared to TeraChem, QUICK, and GPU4PySCF respectively. An increased average speedup of 2.1× over TeraChem is observed when using four A100 GPUs. Strong scaling analysis reveals over 91% parallel efficiency on four GPUs for opt-Brc, making it typically faster for multi-GPU execution. Single-compute-node comparisons with CPU-based software like ORCA and Q-Chem show speedups of up to 42× and 31×, respectively, enhancing power efficiency by up to 18×.The authors thank the National Energy Research Scientific Computing Center (NERSC), a Department of Energy Office of Science User Facility using award ERCAP0026496 for resource allocation on the Perlmutter supercomputer. EP and RS acknowledge the National Industry PhD program, the Department of Education and QDX technologies for providing additional funding. The authors also thank the NCMAS and ANUMAS computational allocation schemes for access to the Gadi supercomputer at NCI and the Setonix supercomputer at the Pawsey Supercomputing Centre.19en© 2024 The Author(s)Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters202410.1021/acs.jctc.4c0099485210314880