Basic fault tolerance issues
Fault tolerance usually covers three steps:
Fault detection is the process of discovering that one or several processes have failed. While the FT-MPI specification makes no statement about how faulty processes are discovered, it assumes that they are discovered by the run-time environment. FT-MPI makes no assumption about when faulty processes are discovered. FT-MPI does furthermore not specify when a process is considered to have failed.
Notification deals with the problem of how the other MPI processes of parallel job get informed about the failure event. FT-MPI makes no assump- tions when the processes are notified nor does it assume, that all processes are notified simultaniously. FT-MPI just specifies, that all processes of a parallel job are receiving a notification about death events.
The notification of failed processes are passed to the MPI application through a special error code. For achieving the largest possible conformance to the MPI-1 and MPI-2 specification, FT-MPI is not introducing a new error code, but defines, that MPI ERR OTHER is just to be used to signal the MPI application, that some processes have unexpectedly left the run- time environment.
As soon as an application process has received the notification of a death event through the MPI error code MPI ERR OTHER, its general state has changed from ’NO FAILURES’ to ’FAILURE RECOGNIZED’. While in this