Fault Tolerant Computation with the Sparse Grid Combination Technique

Loading...
Thumbnail Image

Date

Authors

Harding, Brendan
Hegland, Markus
Larson, Jay
Southern, J.

Journal Title

Journal ISSN

Volume Title

Publisher

SIAM Publications

Abstract

This paper continues to develop a fault tolerant extension of the sparse grid combination technique recently proposed in [B. Harding and M. Hegland, ANZIAM J. Electron. Suppl., 54 (2013), pp. C394-C411]. This approach to fault tolerance is novel for two reasons: First, the combination technique adds an additional level of parallelism, and second, it provides algorithmbased fault tolerance so that solutions can still be recovered if failures occur during computation. Previous work indicates how the combination technique may be adapted for a low number of faults. In this paper we develop a generalization of the combination technique for which arbitrary collections of coarse approximations may be combined to obtain an accurate approximation. A general fault tolerant combination technique for large numbers of faults is a natural consequence of this work. Using a renewal model for the time between faults on each node of a high performance computer, we also provide bounds on the expected error for interpolation with this algorithm in the presence of faults. Numerical experiments solving the scalar advection PDE demonstrate that the algorithm is resilient to faults on a real application. It is observed that the time to solution is not significantly affected by the presence of (simulated) faults. Additionally the expected error increases with the number of faults but is relatively small even for high fault rates. A comparison with traditional checkpoint-restart methods applied to the combination technique shows that our approach is highly scalable with respect to the number of faults.

Description

Keywords

Citation

Source

SIAM Journal on Scientific Computing

Book Title

Entity type

Access Statement

License Rights

Restricted until

2037-12-31