High Performance Fault-Tolerant Solution of PDEs using the Sparse Grid Combination Technique
Abstract
The data volume of Partial Differential Equation (PDE) based
ultra-large-scale scientific simulations is increasing at a
higher rate than that of the system’s processing power. To
process the increased amount of simulation data within a
reasonable amount of time, the evolution of computation is
expected to reach the exascale level. One of several key
challenges to overcome in these exascale systems is to handle the
high rate of component failure arising due to having millions of
cores working together with high power consumption and clock
frequencies. Studies show that even the highly tuned widely used
checkpointing technique is unable to handle the failures
efficiently in exascale systems. The Sparse Grid Combination
Technique (SGCT) is proved to be a cost-effective method for
computing high-dimensional PDE based simulations with only small
loss of accuracy, which can be easily modified to provide an
Algorithm-Based Fault Tolerance (ABFT) for these applications.
Additionally, the recently introduced User Level Failure
Mitigation (ULFM) MPI library provides the ability to detect and
identify application process failures, and reconstruct the failed
processes. However, there is a gap of the research how these
could be integrated together to develop fault-tolerant
applications, and the range of issues that may arise in the
process are yet to be revealed.
My thesis is that with suitable infrastructural support an
integration of ULFM MPI and a modified form of the SGCT can be
used to create high performance robust PDE based applications.
The key contributions of my thesis are: (1) An evaluation of the
effectiveness of applying the modified version of the SGCT on
three existing and complex applications (including a general
advection solver) to make them highly fault-tolerant. (2) An
evaluation of the capabilities of ULFM MPI to recover from a
single or multiple real process/node failures for a range of
complex applications computed with the modified form of the SGCT.
(3) A detailed experimental evaluation of the fault-tolerant work
including the time and space requirements, and parallelization on
the non-SGCT dimensions. (4) An analysis of the result errors
with respect to the number of failures. (5) An analysis of the
ABFT and recovery overheads. (6) An in-depth comparison of the
fault-tolerant SGCT based ABFT with traditional checkpointing on
a non-fault-tolerant SGCT based application. (7) A detailed
evaluation of the infrastructural support in terms of load
balancing, pure- and hybrid-MPI, process layouts, processor
affinity, and so on.
Description
Keywords
Fault Tolerance, ULFM, Failure Detection, Failure Identification, Process Failure Recovery, Node Failure Recovery, PDE Solver, Sparse Grid Combination Technique, Algorithm-Based Fault Tolerance, Approximation Error, Load Balancing, Pure-MPI, Hybrid-MPI, Process Layouts, Processor Affinity, Gyrokinetic Plasma, Lattice Boltzmann Method, Solid Fuel Ignition
Citation
Collections
Source
Type
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description