Proceedings of the 16th International Conference on Supercomputing 2002
DOI: 10.1145/514191.514205
|View full text |Cite
|
Sign up to set email alerts
|

A network-failure-tolerant message-passing system for terascale clusters

Abstract: The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standardcompliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple netw… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2003
2003
2009
2009

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 31 publications
(9 citation statements)
references
References 2 publications
0
9
0
Order By: Relevance
“…Graham et al provide a detailed description of the problem and have introduced LA-MPI in order to take advantage of network-based fault-tolerance [18]. Network fault-tolerance solutions, such as LA-MPI, complement our work.…”
Section: Related Workmentioning
confidence: 57%
“…Graham et al provide a detailed description of the problem and have introduced LA-MPI in order to take advantage of network-based fault-tolerance [18]. Network fault-tolerance solutions, such as LA-MPI, complement our work.…”
Section: Related Workmentioning
confidence: 57%
“…The remaining work on distributed transparent checkpointing can be divided into two categories: 1) User-level MPI libraries for checkpointing [4], [5], [12], [14], [15], [32], [34], [36], [37]: works for distributed processes, but only if they communicate exclusively through MPI (Message Passing Interface). Typically restricted to a particular dialect of MPI.…”
Section: Related Workmentioning
confidence: 99%
“…Our work is different from theirs in the respect that we are analyzing the benefits of different RDMA semantics to reduce the number of control messages. Majumder et al [18] have proposed an event based progress mechanism for LA-MPI [9]. They indicate the benefits of such an approach to overlap in applications.…”
Section: Related Workmentioning
confidence: 99%