High Performance Fault-Tolerant Solution of PDEs using the Sparse Grid Combination Technique
| dc.contributor.author | Ali, Md Mohsin | en_AU |
| dc.date.accessioned | 2016-10-13T23:57:57Z | |
| dc.date.available | 2016-10-13T23:57:57Z | |
| dc.date.issued | 2016 | |
| dc.description.abstract | The data volume of Partial Differential Equation (PDE) based ultra-large-scale scientific simulations is increasing at a higher rate than that of the system’s processing power. To process the increased amount of simulation data within a reasonable amount of time, the evolution of computation is expected to reach the exascale level. One of several key challenges to overcome in these exascale systems is to handle the high rate of component failure arising due to having millions of cores working together with high power consumption and clock frequencies. Studies show that even the highly tuned widely used checkpointing technique is unable to handle the failures efficiently in exascale systems. The Sparse Grid Combination Technique (SGCT) is proved to be a cost-effective method for computing high-dimensional PDE based simulations with only small loss of accuracy, which can be easily modified to provide an Algorithm-Based Fault Tolerance (ABFT) for these applications. Additionally, the recently introduced User Level Failure Mitigation (ULFM) MPI library provides the ability to detect and identify application process failures, and reconstruct the failed processes. However, there is a gap of the research how these could be integrated together to develop fault-tolerant applications, and the range of issues that may arise in the process are yet to be revealed. My thesis is that with suitable infrastructural support an integration of ULFM MPI and a modified form of the SGCT can be used to create high performance robust PDE based applications. The key contributions of my thesis are: (1) An evaluation of the effectiveness of applying the modified version of the SGCT on three existing and complex applications (including a general advection solver) to make them highly fault-tolerant. (2) An evaluation of the capabilities of ULFM MPI to recover from a single or multiple real process/node failures for a range of complex applications computed with the modified form of the SGCT. (3) A detailed experimental evaluation of the fault-tolerant work including the time and space requirements, and parallelization on the non-SGCT dimensions. (4) An analysis of the result errors with respect to the number of failures. (5) An analysis of the ABFT and recovery overheads. (6) An in-depth comparison of the fault-tolerant SGCT based ABFT with traditional checkpointing on a non-fault-tolerant SGCT based application. (7) A detailed evaluation of the infrastructural support in terms of load balancing, pure- and hybrid-MPI, process layouts, processor affinity, and so on. | en_AU |
| dc.identifier.other | b40393987 | |
| dc.identifier.uri | http://hdl.handle.net/1885/109292 | |
| dc.language.iso | en | en_AU |
| dc.subject | Fault Tolerance | en_AU |
| dc.subject | ULFM | en_AU |
| dc.subject | Failure Detection | en_AU |
| dc.subject | Failure Identification | en_AU |
| dc.subject | Process Failure Recovery | en_AU |
| dc.subject | Node Failure Recovery | en_AU |
| dc.subject | PDE Solver | en_AU |
| dc.subject | Sparse Grid Combination Technique | en_AU |
| dc.subject | Algorithm-Based Fault Tolerance | en_AU |
| dc.subject | Approximation Error | en_AU |
| dc.subject | Load Balancing | en_AU |
| dc.subject | Pure-MPI | en_AU |
| dc.subject | Hybrid-MPI | en_AU |
| dc.subject | Process Layouts | en_AU |
| dc.subject | Processor Affinity | en_AU |
| dc.subject | Gyrokinetic Plasma | en_AU |
| dc.subject | Lattice Boltzmann Method | en_AU |
| dc.subject | Solid Fuel Ignition | en_AU |
| dc.title | High Performance Fault-Tolerant Solution of PDEs using the Sparse Grid Combination Technique | en_AU |
| dc.type | Thesis (PhD) | en_AU |
| dcterms.valid | 2016 | en_AU |
| local.contributor.affiliation | Research School of Computer Science, College of Engineering and Computer Science, The Australian National University | en_AU |
| local.contributor.supervisor | Strazdins, Peter | |
| local.identifier.doi | 10.25911/5d7786f9d5ed1 | |
| local.mintdoi | mint | |
| local.type.degree | Doctor of Philosophy (PhD) | en_AU |