Skip navigation
Skip navigation

Algorithm-based fault recovery of adaptively refined parallel multilevel grids

Stals, Linda

Description

On future extreme scale computers, it is expected that faults will become an increasingly serious problem as the number of individual components grows and failures become more frequent. This is driving the interest in designing algorithms with built-in fault tolerance that can continue to operate and that can replace data even if part of the computation is lost in a failure. For fault-free computations, the use of adaptive refinement techniques in combination with finite element methods is well...[Show more]

dc.contributor.authorStals, Linda
dc.date.accessioned2019-04-08T23:14:06Z
dc.identifier.issn1094-3420
dc.identifier.urihttp://hdl.handle.net/1885/159361
dc.description.abstractOn future extreme scale computers, it is expected that faults will become an increasingly serious problem as the number of individual components grows and failures become more frequent. This is driving the interest in designing algorithms with built-in fault tolerance that can continue to operate and that can replace data even if part of the computation is lost in a failure. For fault-free computations, the use of adaptive refinement techniques in combination with finite element methods is well established. Furthermore, iterative solution techniques that incorporate information about the grid structure, such as the parallel geometric multigrid method, have been shown to be an efficient approach to solving various types of partial different equations. In this article, we present an advanced parallel adaptive multigrid method that uses dynamic data structures to store a nested sequence of meshes and the iteratively evolving solution. After a fail-stop fault, the data residing on the faulty processor will be lost. However, with suitably designed data structures, the neighbouring processors contain enough information so that a consistent mesh can be reconstructed in the faulty domain with the goal of resuming the computation without having to restart from scratch. This recovery is based on a set of carefully designed distributed algorithms that build on the existing parallel adaptive refinement routines, but which must be carefully augmented and extended.
dc.format.extent23 pages
dc.format.mimetypeapplication/pdf
dc.language.isoen_AU
dc.publisherSAGE Publications
dc.rights© 2019 Sage Publications
dc.sourceInternational Journal of High Performance Computing Applications
dc.subjectfault-tolerant algorithms
dc.subjectadaptive finite elements
dc.subjectmultigrid
dc.subjectdynamic data structures
dc.titleAlgorithm-based fault recovery of adaptively refined parallel multilevel grids
dc.typeJournal article
local.description.notesImported from ARIES
local.identifier.citationvolume33
dc.date.issued2019-01-01
local.identifier.absfor010302 - Numerical Solution of Differential and Integral Equations
local.identifier.ariespublicationa383154xPUB9925
local.publisher.urlhttps://uk.sagepub.com/en-gb/eur/home
local.type.statusPublished Version
local.contributor.affiliationStals, Linda, College of Science, The Australian National University
local.description.embargo2037-12-31
local.identifier.essn1741-2846
local.bibliographicCitation.issue1
local.bibliographicCitation.startpage189
local.bibliographicCitation.lastpage211
local.identifier.doi10.1177/1094342017720801
local.identifier.absseo970101 - Expanding Knowledge in the Mathematical Sciences
dc.date.updated2019-03-12T07:22:27Z
local.identifier.scopusID2-s2.0-85041627348
CollectionsANU Research Publications

Download

File Description SizeFormat Image
02 Stals Algorithm-based fault recovery 2017. pdf2.5 MBAdobe PDF    Request a copy


Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.

Updated:  17 November 2022/ Responsible Officer:  University Librarian/ Page Contact:  Library Systems & Web Coordinator