This section explaines the expected behavior of messages before, during and after recovery. The major problem arises from the fact, that typically some messages will be ’within the system’ while an error occurs. In this section, we define the behavior of messages which are on the fly why an error occurs. Two general rules apply for all message modes:
1. All messages from and to dead processes are discarded, independent of recovery, communicator or message mode.
2. All collective operations will stop immediatly and all messages ini- tiated by collective operations will be discarded, independent of the recovery, communicator or message mode. In the following subsecion, we will furthermore discuss the behavior of collective operations while an error occurs.
For explaining the di erence between the two message modes provided by the FT-MPI specification, we would like to introduce the terminology of a generation count for communicators. If MPI COMM WORLD has a generation count of x before a process failes, MPI COMM WORLD will
have a generation count of y after recovery, with y is not a feature an end-user has to be aware of, definition of the following two message modes:
x. A generation count but the term eases the
FTMPI MSG MODE RESET: This mode specifies, that a message sent from process a to process b using a communicator with a gen- eration count x cannot be received with any communicator having the generation count y, even if the processes a and b are both sur- viving processes. This mode basically implies, that all ongoing and