X hits on this document

PDF document

1.2 Current Parallel Programming Paradigms - page 14 / 33

71 views

0 shares

0 downloads

0 comments

14 / 33

1

2

3

4 5

6 7

8

9

1

11

12

13

14

15

16

17

18

19

2 21

BASIC FAULT-TOLERANCE ISSUES

5

state, the process is just allowed to execute certain actions. These actions are depending on various other parameters and are detailed later in the document.

Rationale While the introduction of a new error-code indicating failed processes would have been desirable, currently we consider the requirement to have an FT-MPI specification, which allows to run an application written according to the FT-MPI specification on any regular non FT-MPI conformant implementation of MPI as more important than a ’cleaner’ solution at this point. It is however still desirable to introduce a separate error-code in future specifications. A future of FT-MPI will furthermore deal with the problem of whom to notify in dynamic MPI-2 environments.

Advice to implementors A high quality implementation of the FT-MPI specification shall distinguish to the largest possible ex- tent, whether a process has died due to an error in the application (e.g. segmentation violation) or because of a failure in the hard- ware or run-time environment.

The recovery procedure is the superset of steps necessary to move the status of MPI application processes and the MPI run-time environment from ’FAILURE RECOGNIZED’ back to ’NO FAILURE’. Most of the FT-MPI specification is dealing with the problem how to move processes back into the ’NO FAILURE’ mode, and what options are given to the user.

The recovery procedure is considered to have two steps:

1. Recovering the MPI run-time environment and the MPI library. This step will be handled in great details in the following sections.

2. Recover the application and application data: this step is considered to be the responsibility of the application and not of the MPI library. The FT-MPI specification makes no assumptions or statements about how an application recovers data from one or several lost processes.

Rationale in contrary to many currently available projects, FT- MPI does not provide an interface for checkpointing and recovering user data. Such an interface might be added in later versions of the FT-MPI specification, is however not considered in the current version.

Document info
Document views71
Page views71
Page last viewedSun Dec 04 00:16:41 UTC 2016
Pages33
Paragraphs1047
Words8761

Comments