which is to immediately abort the application. The second possibility is to hand the control back to the user application (if possible) without guaran- teeing, that any further communication can occur. The latter mode mainly has the purpose of giving the application the possibility to close all files properly, write maybe a per-process based checkpoint etc., before exiting the application.
The goal of this document is to bridge the gap between the more and more robust, fault-tolerant hardware which has evolved over the last years and the main programing paradigm used by scientific application, which does not o er process fault-tolerance in its current specifications.
Organization of this Document
This document is organized as follows: in chapter 2 we introduce the basic terminologies and definitions used throughout the document. The following chapters 3, 4, 5 specify the di erent failure recovery models supported by FT-MPI. Chapter 6 defines various other minor improvements, which help writing fault-tolerant applications using the FT-MPI specification.