Extensions to the Message-Passing Interface for Process Fault –Tolerance
This section summarizes the FT-MPI specification. The full document can be found in Appendix A. Handling fault-tolerance typically consists of three steps: failure detection, notification and recovery. The FT-MPI specification makes no general assumptions about the first two issues, with the exception that it assumes, that the run-time environment discovers failing processes and all processes of the according parallel jobs are notified about the failure events.
The notification of failed processes is passed to the MPI application through the usage of a special error code. As soon as an application process has received the notification of a death event through this error code, its general state is changing from no failures to failure recognized. While in this state, the process is just allowed to execute certain actions. These actions are depending on various parameters and are detailed later in the document.
The recovery procedure is considered to consist of two steps again: recovering the MPI library and the run-time environment, and recovering the application. The latter one is considered to be the responsibility of the application.
The main problems the FT-MPI specification is dealing with are answers to the following questions:
What are the necessary steps and options to start the recovery procedure and therefore change the state of the processes back to no failure?
What is the status of the MPI objects after recovery?
What is the status of ongoing communication and messages during and after recovery?
The first question is handled by the so-called recovery mode, the second by the communicator mode, the third by the message mode respectively the collective communication mode.
The recovery mode defines how the recovery procedure can be started. Currently, there are three options defined:
an automatic recovery mode, where the recovery procedure is started automatically by the MPI library as soon as a failure event has been recognized
a manual recovery mode, where the application has to start the recovery procedure through the usage of a special MPI function
a recovery mode, where the recovery procedure does not have to be initiated at all. However, any communication to failed processes will raise an error.
The status of MPI objects after the recovery operation is depending on whether they contain some global information or not. As for MPI-1, the only objects containing global information are groups and communicators. These objects are ‘destroyed’ during the recovery procedure and only the objects available after MPI_Init are re-instantiated by the library (MPI_COMM_WORLD and MPI_COMM_SELF).
Communicators and group can have different formats after recovery operation. Failed processes can either be replaced (FTMPI_COMM_MODE_REBUILD), or not. In case the failed processes are not replaced, the user still has two choices: the position of the failed process can be left empty in groups and communicators (FTMPI_COMM_MODE_BLANK) or the groups and communicators can shrink such that no gap is left (FTMPI_COMM_MODE_SHRINK). For both modes a precise description of all MPI-1 functions are given in Appendix A.
Furthermore, the specification has to clarify what the status of currently ongoing messages is while an error occurs and is recognized. In one mode, all currently ongoing messages are cancelled by the system. This mode is mainly useful for applications, which on an error roll-back to the last consistent state in the application. As an example, if an error occurs in iteration 423 and the last consistent state of the application is from iteration 400, than all ongoing messages from iteration 423 would just confuse the application after having performed the roll-back. The