This section discusses the various options available for collective operations. While it possible to define, when a point-to-point operation has failed or succeeded, it is a lot more difficult to make a similar definition for collective operations. The major question is dealing with the problem what guarantee the MPI library is making to the application with respect to the fact, that
everybody has the same return code for the collective operation (e.g. everybody succeeds or every process returns an error)
the recv data bu ers are either correct on all processes or not touched on any of them.
Therefore, FT-MPI specifies two di erent modes how to handle collective operations:
1. FTMPI COLL MODE ATOMIC: this mode gives strong guarantees, that either every process reports an error or none.
2. FTMPI COLL MODE NONATOMIC: no guarantee is given, that all processes involved in the collective operation are returning the same code. Some processes might report, that the operation has succeeded, while others report an error.
Advice to users The atomic mode seems very appealing to end- users, because of the strong guarantees it is giving. Users should however be aware of, that this strong guarantee is coming at the price of higher memory consumption and higher execution time for the collective operations.
Two features make the non-atomic mode still usable in fault tolerant applications:
First, the definition of collective operations in MPI-1 and MPI-2 is such that, the input bu ers are not modified in a collective operation. Thus, a collective operation can easily be repeated by the application. Exception: the usage of the MPI IN PLACE argument of MPI-2.
Second, a similar behavior like the atomic mode can be achieved by adding a barrier operation after a collective operation. Using this tech- nique, the user has the choice to ’define’ which operations he would need having atomic behaviour and which not. This might have dra- matic impact on the application performance.