A fault tolerant protocol for massively parallel systems

Chakravorty, Sayantan; Kalé, Laxmikant V.

doi:10.1109/ipdps.2004.1303244

Cited by 26 publications

(22 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to calculate the parameter ( ; ) X P X S , we should enumerate the number of all existing paths facing the − X shape fault pattern and divide them by the number of all existing paths in the connected × R C torus network. This probability is expressed formally as hit The number of minimal paths crossing the fault region P The number of all minimal paths existing in the network (1) The following theorem provides the total number of paths with minimal length in the network. ( , )…”

Section: Remark: a Path Facing The Fault-pattern Means That There Eximentioning

confidence: 99%

“…To be able to adapt with faults without serious degradation of the service, networks and routing protocols have to be set up so that they are fault-tolerant. Several recent studies address faulttolerance in a diverse range of systems and applications [1][2][3][4][5][6][7][8][9][10][11][12]. Almost all of the performance evaluation studies for functionality of these systems, however, have made use solely of simulation experiments.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On the Probability of Facing Fault Patterns: A Performance and Comparison Measure of Network Fault-Tolerance

Safaei

Khonsari

Moraveji

2008

Computational Science – ICCS 2008

View full text Add to dashboard Cite

Abstract. An important issue in the design and deployment of interconnection networks is the issue of network fault-tolerance for various types of failures. In designing parallel processing using torus as the underlying interconnection topology as well as in designing real applications on such processors, the estimates of the network reliability and fault-tolerance are important in choosing the routing algorithms and predicting their performance in the presence of faulty nodes. Under node-failure model, the faulty nodes may coalesce into fault patterns, which classified into two major categories, i.e., convex (|-shaped, -shaped) and concave (L-shaped, T-shaped, +-shaped, H-shaped, U-shaped) regions. In this correspondence, we propose the first solution for computing the probability of message facing the fault patterns in tori both for convex and concave regions that is verified using simulation experiments. Our approach works for any number of faults as long as the network remains connected. We use these models to measure the network faulttolerance that can be achieved by adaptive routings, and to assess the impact of various fault patterns on the performance of such networks.

show abstract

Section: Remark: a Path Facing The Fault-pattern Means That There Eximentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

On the Probability of Facing Fault Patterns: A Performance and Comparison Measure of Network Fault-Tolerance

Safaei

Khonsari

Moraveji

2008

Computational Science – ICCS 2008

View full text Add to dashboard Cite

show abstract

“…In recent years, many researchers have addressed to several issues in the field of fault-tolerance and reliability analysis of large scale parallel and distributed systems [4][5][6][7][8][9][10][11][12][13][14][15][16]. These researches span a diverse range of systems and applications such as massively parallel processors [8], cluster-based systems [9], mobile systems [10], sensor networks [11], and more recently network on chip [1].…”

Section: Introductionmentioning

confidence: 99%

“…These researches span a diverse range of systems and applications such as massively parallel processors [8], cluster-based systems [9], mobile systems [10], sensor networks [11], and more recently network on chip [1].…”

Section: Introductionmentioning

confidence: 99%

On Quantifying Fault Patterns of the Mesh Interconnect Networks

Safaei

Khonsari

Ould‐Khaoua

et al. 2007

21st International Conference on Advanced Information Networking and Applications (AINA '07)

View full text Add to dashboard Cite

show abstract

“…Charm++ consists of a variety of broadly applicable high-performance tools integrated in a single run-time system. Virtualization techniques are employed for hiding latency via message-driven execution [2], automatic applicationindependent load balancing [3], automatic communication optimization [4], check-pointing [5], fault tolerance [6,7], and performance visualization and analysis [8]. All of these tools help make a parallel code run better, but even with Charm++, developing a new parallel program still requires many hours of effort.…”

Section: Introductionmentioning

confidence: 99%

ParFUM: a parallel framework for unstructured meshes for scalable dynamic physics applications

Lawlor

Chakravorty

Wilmarth

et al. 2006

Engineering with Computers

View full text Add to dashboard Cite

Unstructured meshes are used in many engineering applications with irregular domains, from elastic deformation problems to crack propagation to fluid flow. Because of their complexity and dynamic behavior, the development of scalable parallel software for these applications is challenging. The Charm++ Parallel Framework for Unstructured Meshes allows one to write parallel programs that operate on unstructured meshes with only minimal knowledge of parallel computing, while making it possible to achieve excellent scalability even for complex applications. Charm++'s messagedriven model enables computation/communication overlap, while its run-time load balancing capabilities make it possible to react to the changes in computational load that occur in dynamic physics applications. The framework is highly flexible and has been enhanced with numerous capabilities for the manipulation of unstructured meshes, such as parallel mesh adaptivity and collision detection. 1

show abstract

A fault tolerant protocol for massively parallel systems

Cited by 26 publications

References 18 publications

On the Probability of Facing Fault Patterns: A Performance and Comparison Measure of Network Fault-Tolerance

On the Probability of Facing Fault Patterns: A Performance and Comparison Measure of Network Fault-Tolerance

On Quantifying Fault Patterns of the Mesh Interconnect Networks

ParFUM: a parallel framework for unstructured meshes for scalable dynamic physics applications

Contact Info

Product

Resources

About