Fault-tolerant solutions for a MPI compute intensive application

Mouriño, J. Carlos

doi:10.1109/pdp.2007.44

Cited by 4 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [8], for computationally intensive applications using MPI, two approaches for checkpoint based fault tolerance is proposed. Firstly, segment-level solution, an extension of a checkpoint library for sequential codes.…”

Section: Background and Related Workmentioning

confidence: 99%

Can Agent Intelligence Be Used to Achieve Fault Tolerant Parallel Computing Systems?

Varghese

McKee

Alexandrov

2011

Parallel Process. Lett.

View full text Add to dashboard Cite

The work reported in this paper is motivated towards validating an alternative approach for fault tolerance over traditional methods like checkpointing that constrain efficacious fault tolerance. Can agent intelligence be used to achieve fault tolerant parallel computing systems? If so, "What agent capabilities are required for fault tolerance?", "What parallel computational tasks can benefit from such agent capabilities?" and "How can agent capabilities be implemented for fault tolerance?" need to be addressed. Cognitive capabilities essential for achieving fault tolerance through agents are considered. Parallel reduction algorithms are identified as a class of algorithms that can benefit from cognitive agent capabilities. The Message Passing Interface is utilized for implementing an intelligent agent based approach. Preliminary results obtained from the experiments validate the feasibility of an agent based approach for achieving fault tolerance in parallel computing systems.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Can Agent Intelligence Be Used to Achieve Fault Tolerant Parallel Computing Systems?

Varghese

McKee

Alexandrov

2011

Parallel Process. Lett.

View full text Add to dashboard Cite

show abstract

“…Mourino et al propose two approaches for checkpoint based fault tolerance in computationally intensive applications using MPI [7]. Firstly, segment-level solution, an extension of a checkpoint library for sequential codes.…”

Section: Introductionmentioning

confidence: 99%

A Transition from Traditional Checkpointing towards Multi-Agent based Approaches

McKee¹,

Varghese²,

Alexandrov³

2010

IJCTE

View full text Add to dashboard Cite

Abstract-Middleware for parallel computing systems incorporate checkpointing to achieve fault tolerance. Most traditional checkpointing approaches tend to be less dynamic in large scale parallel computing environments. Hence, there arises a need for an adaptive and dynamic approach. The work reported in this paper, proposes a multi-agent based approach for fault tolerance. Five resources namely, the executed problem, parallel computing platform, middleware, hardware abstraction and agents that contribute towards the infrastructure of the proposed approach is considered. The approach is implemented on a computer cluster and experimental results are presented to validate the feasibility of the approach and its contribution towards enhancing fault tolerance.Index Terms-middleware approach, multi-agent, fault tolerance, parallel computing systems.

show abstract

“…It is not desirable to have to restart a job from the beginning if it has been executing for hours or days or months [6]. A key challenge in maintaining the seamless (or near seamless) execution of such jobs in the event of failures is addressed under research in fault tolerance [7,8,9,10].Many jobs rely on fault tolerant approaches that are implemented in the middleware supporting the job (for example [6,11,12,13]). The conventional fault tolerant mechanism supported by the middleware is checkpointing [14,15,16,17], which involves the periodic recording of intermediate states of execution of a job to which execution can be returned if a fault occurs.…”

mentioning

confidence: 99%

“…Many jobs rely on fault tolerant approaches that are implemented in the middleware supporting the job (for example [6,11,12,13]). The conventional fault tolerant mechanism supported by the middleware is checkpointing [14,15,16,17], which involves the periodic recording of intermediate states of execution of a job to which execution can be returned if a fault occurs.…”

mentioning

confidence: 99%

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches

Varghese

McKee

Alexandrov

2014

Computers in Biology and Medicine

View full text Add to dashboard Cite

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches Varghese, B., McKee, G., & Alexandrov, V. (2014). Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk. Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance.Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application.Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.high-performance computing | fault tolerance | biological jobs | multi-agents | seamless execution | checkpoint Introduction T he scale of resources and computations required for executing large-scale biological jobs are significantly increasing [1,2]. With this increase the resultant number of failures while running these jobs will also increase and the time between failures will decrease [3,4,5]. It is not desirable to have to restart a job from the beginning if it has been executin...

show abstract

Fault-tolerant solutions for a MPI compute intensive application

Abstract: Abstract

Cited by 4 publications

References 14 publications

Can Agent Intelligence Be Used to Achieve Fault Tolerant Parallel Computing Systems?

Can Agent Intelligence Be Used to Achieve Fault Tolerant Parallel Computing Systems?

A Transition from Traditional Checkpointing towards Multi-Agent based Approaches

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches

Contact Info

Product

Resources

About