Application developers and end-users of high performance computing sys- tems have today access to larger machines and more processors than ever before. High-end systems consist nowadays of thousands of processors. Ad- ditionally, not only the individual machines are getting bigger, but with the recently increased network capacities, users have access to higher number of machines and computing resources. Concurrently using several computing resources, often referred to as Grid- or Metacomputing, further increases the number of processors used in each single job as well as the overall number of jobs, which a user can launch.
With increasing number of processors however, the probability, that an application is facing a node or link failure is also increasing. While on earlier massively parallel processing systems (MPPs), a crashing node often was identical to a system crash, current systems are more robust. Usually, the application running on this node has to abort, however, the system in general is not e ected by a processor failure. In Grid environments, a system may additionally become unavailable for a certain time due to network problems, leading to a similar problem from the application point of view like a crashing node on a single system.
The Message Passing Interface (MPI) [1, 2] is the de-facto standard for the communication in scientific applications. However, MPI in its current specification gives the user no possibility to handle the situation mentioned above, where one or more processors are becoming unavailable during run- time. Current MPI specifications give the user the choice between two pos- sibilities of how to handle a failure. The first possibility is the default mode,