High Performance Fault-Tolerant Solution of PDEs using the Sparse Grid Combination Technique

dc.contributor.authorAli, Md Mohsinen_AU
dc.date.accessioned2016-10-13T23:57:57Z
dc.date.available2016-10-13T23:57:57Z
dc.date.issued2016
dc.description.abstractThe data volume of Partial Differential Equation (PDE) based ultra-large-scale scientific simulations is increasing at a higher rate than that of the system’s processing power. To process the increased amount of simulation data within a reasonable amount of time, the evolution of computation is expected to reach the exascale level. One of several key challenges to overcome in these exascale systems is to handle the high rate of component failure arising due to having millions of cores working together with high power consumption and clock frequencies. Studies show that even the highly tuned widely used checkpointing technique is unable to handle the failures efficiently in exascale systems. The Sparse Grid Combination Technique (SGCT) is proved to be a cost-effective method for computing high-dimensional PDE based simulations with only small loss of accuracy, which can be easily modified to provide an Algorithm-Based Fault Tolerance (ABFT) for these applications. Additionally, the recently introduced User Level Failure Mitigation (ULFM) MPI library provides the ability to detect and identify application process failures, and reconstruct the failed processes. However, there is a gap of the research how these could be integrated together to develop fault-tolerant applications, and the range of issues that may arise in the process are yet to be revealed. My thesis is that with suitable infrastructural support an integration of ULFM MPI and a modified form of the SGCT can be used to create high performance robust PDE based applications. The key contributions of my thesis are: (1) An evaluation of the effectiveness of applying the modified version of the SGCT on three existing and complex applications (including a general advection solver) to make them highly fault-tolerant. (2) An evaluation of the capabilities of ULFM MPI to recover from a single or multiple real process/node failures for a range of complex applications computed with the modified form of the SGCT. (3) A detailed experimental evaluation of the fault-tolerant work including the time and space requirements, and parallelization on the non-SGCT dimensions. (4) An analysis of the result errors with respect to the number of failures. (5) An analysis of the ABFT and recovery overheads. (6) An in-depth comparison of the fault-tolerant SGCT based ABFT with traditional checkpointing on a non-fault-tolerant SGCT based application. (7) A detailed evaluation of the infrastructural support in terms of load balancing, pure- and hybrid-MPI, process layouts, processor affinity, and so on.en_AU
dc.identifier.otherb40393987
dc.identifier.urihttp://hdl.handle.net/1885/109292
dc.language.isoenen_AU
dc.subjectFault Toleranceen_AU
dc.subjectULFMen_AU
dc.subjectFailure Detectionen_AU
dc.subjectFailure Identificationen_AU
dc.subjectProcess Failure Recoveryen_AU
dc.subjectNode Failure Recoveryen_AU
dc.subjectPDE Solveren_AU
dc.subjectSparse Grid Combination Techniqueen_AU
dc.subjectAlgorithm-Based Fault Toleranceen_AU
dc.subjectApproximation Erroren_AU
dc.subjectLoad Balancingen_AU
dc.subjectPure-MPIen_AU
dc.subjectHybrid-MPIen_AU
dc.subjectProcess Layoutsen_AU
dc.subjectProcessor Affinityen_AU
dc.subjectGyrokinetic Plasmaen_AU
dc.subjectLattice Boltzmann Methoden_AU
dc.subjectSolid Fuel Ignitionen_AU
dc.titleHigh Performance Fault-Tolerant Solution of PDEs using the Sparse Grid Combination Techniqueen_AU
dc.typeThesis (PhD)en_AU
dcterms.valid2016en_AU
local.contributor.affiliationResearch School of Computer Science, College of Engineering and Computer Science, The Australian National Universityen_AU
local.contributor.supervisorStrazdins, Peter
local.identifier.doi10.25911/5d7786f9d5ed1
local.mintdoimint
local.type.degreeDoctor of Philosophy (PhD)en_AU

Downloads

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ali Thesis 2016.pdf
Size:
4.36 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
884 B
Format:
Item-specific license agreed upon to submission
Description: