Resilience in high-level parallel programming languages

dc.contributor.authorHamouda, Sara S.
dc.date.accessioned2019-06-20T23:04:48Z
dc.date.available2019-06-20T23:04:48Z
dc.date.issued2019
dc.description.abstractThe consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputers make supporting task parallelism and resilience a necessity in HPC programming models. Given the complexity of managing multi-threaded distributed execution in the presence of failures, there is a critical need for task-parallel abstractions that simplify writing efficient, modular, and understandable fault-tolerant applications. MPI User-Level Failure Mitigation (MPI-ULFM) is an emerging fault-tolerant specification of MPI. It supports failure detection by returning special error codes and provides new interfaces for failure mitigation. Unfortunately, the unstructured form of failure reporting provided by MPI-ULFM hinders the composability and the clarity of the fault-tolerant programs. The low-level programming model of MPI and the simplistic failure reporting mechanism adopted by MPI-ULFM make MPI-ULFM more suitable as a low-level communication layer for resilient high-level languages, rather than a direct programming model for application development. The asynchronous partitioned global address space model is a high-level programming model designed to improve the productivity of developing large-scale applications. It represents a computation as a global control flow of nested parallel tasks that use global data partitioned among processes. Recent advances in the APGAS model supported control flow recovery by adding failure awareness to the nested parallelism model --- async-finish --- and by providing structured failure reporting through exceptions. Unfortunately, the current implementation of the resilient async-finish model results in a high performance overhead that can restrict the scalability of applications. Moreover, the lack of data resilience support limits the productivity of the model as it shifts the challenges of handling data availability and atomicity under failure to the programmer. In this thesis, we demonstrate that resilient APGAS languages can achieve scalable performance under failure by exploiting fault tolerance features in emerging communication libraries such as MPI-ULFM. We propose multi-resolution resilience, in which high-level resilient constructs are composed from efficient lower-level resilient constructs, as an approach for bridging the gap between the efficiency of user-level fault tolerance and the productivity of system-level fault tolerance. To address the limited resilience efficiency of the async-finish model, we propose 'optimistic finish' --- a message-optimal resilient termination detection protocol for the finish construct. To improve programmer productivity, we augment the APGAS model with resilient data stores that can simplify preserving critical application data in the presence of failure. In addition, we propose the 'transactional finish' construct as a productive mechanism for handling atomic updates on resilient data. Finally, we demonstrate the multi-resolution resilience approach by designing high-level resilient application frameworks based on the async-finish model. We implemented the above enhancements in the X10 language, an embodiment of the APGAS model, and performed empirical evaluation for the performance of resilient X10 using micro-benchmarks and a suite of transactional and non-transactional resilient applications. Concepts of the APGAS model are realized in multiple programming languages, which can benefit from the conceptual and technical contributions of this thesis. The presented empirical evaluation results will aid future comparisons with other resilient programming models.en_AU
dc.identifier.otherb59286416
dc.identifier.urihttp://hdl.handle.net/1885/164137
dc.language.isoen_AUen_AU
dc.subjectAPGASen_AU
dc.subjectResilienceen_AU
dc.subjectFault Toleranceen_AU
dc.subjectX10en_AU
dc.subjectMPI-ULFMen_AU
dc.subjectTransactional Memoryen_AU
dc.subjectCheckpoint-Restarten_AU
dc.subjectAsync-Finishen_AU
dc.subjectTask-Based Runtime Systemsen_AU
dc.subjectTermination Detectionen_AU
dc.subjectTaxonomy of Resilient Programming Modelsen_AU
dc.titleResilience in high-level parallel programming languagesen_AU
dc.typeThesis (PhD)en_AU
dcterms.valid2019en_AU
local.contributor.affiliationResearch School of Computer Science, Australian National Universityen_AU
local.contributor.authoremailsara.salem@anu.edu.auen_AU
local.contributor.supervisorMilthorpe, Josh
local.contributor.supervisorcontactjosh.milthorpe@anu.edu.auen_AU
local.description.notesDeposited by the authoren_AU
local.identifier.doi10.25911/5d0cb264c1c22
local.mintdoiminten_AU
local.type.degreeDoctor of Philosophy (PhD)en_AU

Downloads

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Hamouda_S_thesis_final.pdf
Size:
2.42 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
884 B
Format:
Item-specific license agreed upon to submission
Description: