Resilience in high-level parallel programming languages

Hamouda, Sara S.

Resilience in high-level parallel programming languages

dc.contributor.author	Hamouda, Sara S.
dc.date.accessioned	2019-06-20T23:04:48Z
dc.date.available	2019-06-20T23:04:48Z
dc.date.issued	2019
dc.description.abstract	The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputers make supporting task parallelism and resilience a necessity in HPC programming models. Given the complexity of managing multi-threaded distributed execution in the presence of failures, there is a critical need for task-parallel abstractions that simplify writing efficient, modular, and understandable fault-tolerant applications. MPI User-Level Failure Mitigation (MPI-ULFM) is an emerging fault-tolerant specification of MPI. It supports failure detection by returning special error codes and provides new interfaces for failure mitigation. Unfortunately, the unstructured form of failure reporting provided by MPI-ULFM hinders the composability and the clarity of the fault-tolerant programs. The low-level programming model of MPI and the simplistic failure reporting mechanism adopted by MPI-ULFM make MPI-ULFM more suitable as a low-level communication layer for resilient high-level languages, rather than a direct programming model for application development. The asynchronous partitioned global address space model is a high-level programming model designed to improve the productivity of developing large-scale applications. It represents a computation as a global control flow of nested parallel tasks that use global data partitioned among processes. Recent advances in the APGAS model supported control flow recovery by adding failure awareness to the nested parallelism model --- async-finish --- and by providing structured failure reporting through exceptions. Unfortunately, the current implementation of the resilient async-finish model results in a high performance overhead that can restrict the scalability of applications. Moreover, the lack of data resilience support limits the productivity of the model as it shifts the challenges of handling data availability and atomicity under failure to the programmer. In this thesis, we demonstrate that resilient APGAS languages can achieve scalable performance under failure by exploiting fault tolerance features in emerging communication libraries such as MPI-ULFM. We propose multi-resolution resilience, in which high-level resilient constructs are composed from efficient lower-level resilient constructs, as an approach for bridging the gap between the efficiency of user-level fault tolerance and the productivity of system-level fault tolerance. To address the limited resilience efficiency of the async-finish model, we propose 'optimistic finish' --- a message-optimal resilient termination detection protocol for the finish construct. To improve programmer productivity, we augment the APGAS model with resilient data stores that can simplify preserving critical application data in the presence of failure. In addition, we propose the 'transactional finish' construct as a productive mechanism for handling atomic updates on resilient data. Finally, we demonstrate the multi-resolution resilience approach by designing high-level resilient application frameworks based on the async-finish model. We implemented the above enhancements in the X10 language, an embodiment of the APGAS model, and performed empirical evaluation for the performance of resilient X10 using micro-benchmarks and a suite of transactional and non-transactional resilient applications. Concepts of the APGAS model are realized in multiple programming languages, which can benefit from the conceptual and technical contributions of this thesis. The presented empirical evaluation results will aid future comparisons with other resilient programming models.	en_AU
dc.identifier.other	b59286416
dc.identifier.uri	http://hdl.handle.net/1885/164137
dc.language.iso	en_AU	en_AU
dc.subject	APGAS	en_AU
dc.subject	Resilience	en_AU
dc.subject	Fault Tolerance	en_AU
dc.subject	X10	en_AU
dc.subject	MPI-ULFM	en_AU
dc.subject	Transactional Memory	en_AU
dc.subject	Checkpoint-Restart	en_AU
dc.subject	Async-Finish	en_AU
dc.subject	Task-Based Runtime Systems	en_AU
dc.subject	Termination Detection	en_AU
dc.subject	Taxonomy of Resilient Programming Models	en_AU
dc.title	Resilience in high-level parallel programming languages	en_AU
dc.type	Thesis (PhD)	en_AU
dcterms.valid	2019	en_AU
local.contributor.affiliation	Research School of Computer Science, Australian National University	en_AU
local.contributor.authoremail	sara.salem@anu.edu.au	en_AU
local.contributor.supervisor	Milthorpe, Josh
local.contributor.supervisorcontact	josh.milthorpe@anu.edu.au	en_AU
local.description.notes	Deposited by the author	en_AU
local.identifier.doi	10.25911/5d0cb264c1c22
local.mintdoi	mint	en_AU
local.type.degree	Doctor of Philosophy (PhD)	en_AU

Downloads

Original bundle

Now showing 1 - 1 of 1

Name:: Hamouda_S_thesis_final.pdf
Size:: 2.42 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 884 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Open Access Theses