Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters

dc.contributor.authorPalethorpe, Eliseen
dc.contributor.authorStocks, Ryanen
dc.contributor.authorBarca, Giuseppe M.J.en
dc.date.accessioned2025-05-23T13:25:06Z
dc.date.available2025-05-23T13:25:06Z
dc.date.issued2024en
dc.description.abstractThis Article presents two optimized multi-GPU algorithms for Fock matrix construction, building on the work of Ufimtsev and Martinez [ J. Chem. Theory Comput. 2009, 5, 1004-1015 ] and Barca et al. [ J. Chem. Theory Comput. 2021, 17, 7486-7503 ]. The novel algorithms, opt-UM and opt-Brc, introduce significant enhancements, including improved integral screening, exploitation of sparsity and symmetry, a linear scaling exchange matrix assembly algorithm, and extended capabilities for Hartree-Fock caculations up to f-type angular momentum functions. Opt-Brc excels for smaller systems and for highly contracted triple-ζ basis sets, while opt-UM is advantageous for large molecular systems. Performance benchmarks on NVIDIA A100 GPUs show that our algorithms in the EXtreme-scale Electronic Structure System (EXESS), when combined, outperform all current GPU and CPU Fock build implementations in TeraChem, QUICK, GPU4PySCF, LibIntX, ORCA, and Q-Chem. The implementations were benchmarked on linear and globular systems and average speed ups across three double-ζ basis sets of 1.4×, 8.4×, and 9.4× were observed compared to TeraChem, QUICK, and GPU4PySCF respectively. An increased average speedup of 2.1× over TeraChem is observed when using four A100 GPUs. Strong scaling analysis reveals over 91% parallel efficiency on four GPUs for opt-Brc, making it typically faster for multi-GPU execution. Single-compute-node comparisons with CPU-based software like ORCA and Q-Chem show speedups of up to 42× and 31×, respectively, enhancing power efficiency by up to 18×.en
dc.description.sponsorshipThe authors thank the National Energy Research Scientific Computing Center (NERSC), a Department of Energy Office of Science User Facility using award ERCAP0026496 for resource allocation on the Perlmutter supercomputer. EP and RS acknowledge the National Industry PhD program, the Department of Education and QDX technologies for providing additional funding. The authors also thank the NCMAS and ANUMAS computational allocation schemes for access to the Gadi supercomputer at NCI and the Setonix supercomputer at the Pawsey Supercomputing Centre.en
dc.description.statusPeer-revieweden
dc.format.extent19en
dc.identifier.issn1549-9618en
dc.identifier.otherPubMed:39586097en
dc.identifier.scopus85210314880en
dc.identifier.urihttp://www.scopus.com/inward/record.url?scp=85210314880&partnerID=8YFLogxKen
dc.identifier.urihttps://hdl.handle.net/1885/733752344
dc.language.isoenen
dc.rights© 2024 The Author(s)en
dc.sourceJournal of Chemical Theory and Computationen
dc.titleAdvanced Techniques for High-Performance Fock Matrix Construction on GPU Clustersen
dc.typeJournal articleen
dspace.entity.typePublicationen
local.bibliographicCitation.lastpage10442en
local.bibliographicCitation.startpage10424en
local.contributor.affiliationPalethorpe, Elise; AGRTP Stipend Scholar - CECS, The Australian National Universityen
local.contributor.affiliationStocks, Ryan; ANU College of Systems and Society, The Australian National Universityen
local.contributor.affiliationBarca, Giuseppe M.J.; School of Computing and Information Systemsen
local.identifier.citationvolume20en
local.identifier.doi10.1021/acs.jctc.4c00994en
local.identifier.pure4369a908-4b23-4e10-9876-f3766c436e17en
local.identifier.urlhttps://www.scopus.com/pages/publications/85210314880en
local.type.statusPublisheden

Downloads