3. FTMPI RECOVERY MODE IGNORE: in this mode, the recovery procedure does not have to be initiated at all, as long as no communi- cation with the dead processes are required. Communication involving dead processes (point-to-point operations, collective operations as well as communicator creations) will raise an error and will not be executed.
Rationale This mode has been designed with two things in mind. First, since the recovery procedure is a collective operation, it can be desireable to avoid this collective operation for large numbers of processors (e.g. 100,000). Second, there is a class of applications often refered to as ’naturaly fault-tolerant’ which do not require any special handling on the application level to deal with failed processes.
An MPI library can still abort if a pathological failure has occured from which it can not recovery. Typical reasons for pathological failures could be:
All processes of an MPI job have failed before a recovery operation could be started.
The MPI library has no ’room left’ where to respawn processes.