X hits on this document

PDF document

1.2 Current Parallel Programming Paradigms - page 13 / 33

87 views

0 shares

0 downloads

0 comments

13 / 33

1

Chapter 2

2

Basic fault tolerance issues

3

Fault tolerance usually covers three steps:

4

Fault detection

5

Notification

6

Recovery

7

8

9

1

11

12

Fault detection is the process of discovering that one or several processes have failed. While the FT-MPI specification makes no statement about how faulty processes are discovered, it assumes that they are discovered by the run-time environment. FT-MPI makes no assumption about when faulty processes are discovered. FT-MPI does furthermore not specify when a process is considered to have failed.

13

14

15

16

17

Notification deals with the problem of how the other MPI processes of parallel job get informed about the failure event. FT-MPI makes no assump- tions when the processes are notified nor does it assume, that all processes are notified simultaniously. FT-MPI just specifies, that all processes of a parallel job are receiving a notification about death events.

18

19

2

21

22

23

The notification of failed processes are passed to the MPI application through a special error code. For achieving the largest possible conformance to the MPI-1 and MPI-2 specification, FT-MPI is not introducing a new error code, but defines, that MPI ERR OTHER is just to be used to signal the MPI application, that some processes have unexpectedly left the run- time environment.

24

25

26

As soon as an application process has received the notification of a death event through the MPI error code MPI ERR OTHER, its general state has changed from ’NO FAILURES’ to ’FAILURE RECOGNIZED’. While in this

Document info
Document views87
Page views87
Page last viewedThu Dec 08 18:25:20 UTC 2016
Pages33
Paragraphs1047
Words8761

Comments