A parallel spectral transform Shallow Water Code (PSTSWM)
These applications show, that using the FT-MPI specification one can significantly improve the performance of the application in case an error occurs. As with most fault-tolerant applications known in the literature, there is however a trade-off between the additional resources used to achieve fault-tolerance (memory, processes) and the level of fault-tolerance (e.g. number of process failures which can be survived by the application).
The in-memory checkpointing technology is a very promising approach even for complex applications, as shown for the parallel spectral transform shallow water code (PSTSWM). The reason for the applicability of this technique to complex applications is, that most real-world simulations have anyway a checkpoint-restart interface built in. To use an in-memory checkpoint algorithm usually just requires a modification of the checkpoint and restart routines and not of the whole application.
A fault-tolerant manager-worker framework has been furthermore developed, which does not use in-memory checkpointing. The key point of this framework is to show, that all applications, which can be written using a master-slave paradigm (e.g. parameter sweep studies) can easily be adapted to FT-MPI. The current implementation of the framework can make use of all three communicator modes (rebuild, blank and shrink).
Recent work by Geist and Engelman present new algorithms for solving partial differential equations, which are called ‘naturally fault tolerant algorithms’. Based on mesh-less methods and chaotic relaxation, Geist and Engelman show, that the algorithm still converges correctly, as long as a marginal number of processes are failing, which do not have to be replaced (e.g. using the blank mode). A marginal number of processes in this context can still be 100 process failures in a 100,000-processor job.
The specification has proven to be powerful enough to support not just one of a kind applications, but to support various approaches to handle fault-tolerance and leave room for users to handle fault-tolerance according to the requirements of their applications.
We have presented in this paper an extension to the MPI specification for handling process fault tolerance. Together with the specification, the FT-MPI team at the Innovative Computing Laboratory of the University of Tennessee has developed an implementation of the specification and various application scenarios.
The current specification is in the spirit of the MPI-1 and MPI-2 documents: similarly to MPI-1 and MPI-2, which do not restrict the application developers by offering different data decomposition techniques, FT-MPI does not specify how to handle fault-tolerance on the application level. Instead, FT-MPI offers a rich set of techniques for failing MPI processes and defines the status of MPI objects in case a failure occurs, leaving the applications room for implementing their preferred way to handle fault-tolerance.
Acknowledgments This material is based upon work supported by the Department of Energy under Contract No. DE- FG02-02ER25536. The NSF CISE Research Infrastructure program EIA-9972889 supported the infrastructure used in this work.