Open Research is currently re-indexing its items due to scheduled maintenance on Saturday 14th March 2026. As such not all items in the collection may be searchable at this time.

High Performance Fault-Tolerant Solution of PDEs using the Sparse Grid Combination Technique

Loading...
Thumbnail Image

Date

Authors

Ali, Md Mohsin

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The data volume of Partial Differential Equation (PDE) based ultra-large-scale scientific simulations is increasing at a higher rate than that of the system’s processing power. To process the increased amount of simulation data within a reasonable amount of time, the evolution of computation is expected to reach the exascale level. One of several key challenges to overcome in these exascale systems is to handle the high rate of component failure arising due to having millions of cores working together with high power consumption and clock frequencies. Studies show that even the highly tuned widely used checkpointing technique is unable to handle the failures efficiently in exascale systems. The Sparse Grid Combination Technique (SGCT) is proved to be a cost-effective method for computing high-dimensional PDE based simulations with only small loss of accuracy, which can be easily modified to provide an Algorithm-Based Fault Tolerance (ABFT) for these applications. Additionally, the recently introduced User Level Failure Mitigation (ULFM) MPI library provides the ability to detect and identify application process failures, and reconstruct the failed processes. However, there is a gap of the research how these could be integrated together to develop fault-tolerant applications, and the range of issues that may arise in the process are yet to be revealed. My thesis is that with suitable infrastructural support an integration of ULFM MPI and a modified form of the SGCT can be used to create high performance robust PDE based applications. The key contributions of my thesis are: (1) An evaluation of the effectiveness of applying the modified version of the SGCT on three existing and complex applications (including a general advection solver) to make them highly fault-tolerant. (2) An evaluation of the capabilities of ULFM MPI to recover from a single or multiple real process/node failures for a range of complex applications computed with the modified form of the SGCT. (3) A detailed experimental evaluation of the fault-tolerant work including the time and space requirements, and parallelization on the non-SGCT dimensions. (4) An analysis of the result errors with respect to the number of failures. (5) An analysis of the ABFT and recovery overheads. (6) An in-depth comparison of the fault-tolerant SGCT based ABFT with traditional checkpointing on a non-fault-tolerant SGCT based application. (7) A detailed evaluation of the infrastructural support in terms of load balancing, pure- and hybrid-MPI, process layouts, processor affinity, and so on.

Description

Citation

Source

Book Title

Entity type

Access Statement

License Rights

Restricted until

Downloads