X hits on this document

PDF document

1.2 Current Parallel Programming Paradigms - page 1 / 33





1 / 33

Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems

Graham E. Fagg, Edgar Gabriel, George Bosilca, Thara Angskun, Zhizhong Chen, Jelena Pjesivac-Grbovic, Kevin London and Jack J. Dongarra

Innovative Computing Laboratory, Department of Computer Science, 1122 Volunteer Blvd., Suite 413, University of Tennessee, Knoxville, TN-37996, USA.

{fagg, egabriel, bosilca, angskun, zchen, pjesa, london, dongarra}@cs.utk.edu

  • 1.


    • 1.1

      Trends in High Performance Computing

End-users and application developers of high performance computing systems have today access to larger machines and more processors than ever. Systems such as the Earth Simulator, the ASCI-Q machines or the IBM Blue Gene consist of thousands or even tens of thousand of processors. Machines comprising 100,000 processors are expected for the next years.

A critical issue of systems consisting of such large numbers of processors is the ability of the machine to deal with process failures. Concluding from the current experiences on the top-end machines, a 100,000-processor machine will experience a process failure every few minutes[1]. While on earlier massively parallel processing systems (MPPs) crashing nodes often lead to a crash of the whole system, current architectures are more robust. Typically, the applications utilizing the failed processor will have to abort, the machine, as an entity is however not affected by the failure. This robustness has been the result of improvements at the hardware as well as on the level of system software.

1.2 Current Parallel Programming Paradigms

Current parallel programming paradigms for high-performance computing systems are mainly relying on message passing, especially on the Message-Passing Interface (MPI) [12][13] specification. Shared memory concepts (e.g. OpenMP) or parallel programming languages (e.g. UPC, CoArrayFortran) offer a simpler programming paradigm for applications in parallel environments, however they either lack the scalability to tens of thousands of processors, or do not offer a feasible framework for complex, irregular applications. The message-passing paradigm on the other hand provides a mean to write highly scalable algorithms, abstracting and hiding many architectural decisions from the application developers.

MPI in its current specification is however not dealing with the situation mentioned above, where one or more processes are becoming unavailable during runtime. Currently, MPI gives the user the choice between two possibilities of how to handle failures. The first one, which is also the default mode of MPI, is to immediately abort the application. The second possibility is just slightly more flexible, handing the control back to the user application without guaranteeing however, that any further communication can occur. The latter mode has mainly the purpose to give an application the possibility to perform local operations before exiting, e.g. closing all files or writing a local checkpoint.

Document info
Document views41
Page views41
Page last viewedMon Oct 24 05:34:14 UTC 2016