Enhancing Fault-Tolerance of Large-Scale MPI Scientific Applications

Rodriguez, Giovanna; González, Patricia

doi:10.1007/978-3-540-73940-1_15

Cited by 4 publications

(1 citation statement)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, in general, MPI is rarely selected for developing real-time data processing systems because it does not provide standardized fault tolerance interfaces and semantics. Although extensive research [27,28,29] has been conducted in this area, few available tools exist to help parallel programmers enhance their applications with fault tolerance support. Moreover, the exploitation of MPI is impeded by difficulties in software development.…”

Section: Introductionmentioning

confidence: 99%

OpenCluster: A Flexible Distributed Computing Framework for Astronomical Data Processing

Wei

Wang

Deng

et al. 2016

PASP

View full text Add to dashboard Cite

The volume of data generated by modern astronomical telescopes is extremely large and rapidly growing. However, current high-performance data processing architectures/frameworks are not well suited for astronomers because of their limitations and programming difficulties. In this paper, we therefore presentOpenCluster, an open-source distributed computing framework to support rapidly developing high-performance processing pipelines of astronomical big data. We first detail the OpenCluster design principles and implementations and present the APIs facilitated by the framework. We then demonstrate a case in which OpenCluster is used to resolve complex data processing problems for developing a pipeline for the Mingantu Ultrawide Spectral Radioheliograph. Finally, we present our OpenCluster performance evaluation. Overall, OpenCluster provides not only high fault tolerance and simple programming interfaces, but also a flexible means of scaling up the number of interacting entities. OpenCluster thereby provides an easily integrated distributed computing framework for quickly developing a high-performance data processing system of astronomical telescopes and for significantly reducing software development expenses.

show abstract

Section: Introductionmentioning

confidence: 99%

OpenCluster: A Flexible Distributed Computing Framework for Astronomical Data Processing

Wei

Wang

Deng

et al. 2016

PASP

View full text Add to dashboard Cite

show abstract

Performance evaluation of an application-level checkpointing solution on grids

Rodriguez

Pardo

Martín

et al. 2010

Future Generation Computer Systems

View full text Add to dashboard Cite

Application-Level Fault-Tolerance Solutions for Grid Computing

Díaz

Pardo

Martín

et al. 2008

2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)

View full text Add to dashboard Cite

One of the key functionalities provided by Grid systems is the remote execution of applications. This paper introduces a research proposal on fault-tolerance mechanisms for the execution of sequential and message-passing parallel applications on the Grid. A service-based architecture called CPPC-G is proposed. The CPPC (Controller/Precompiler for Portable Checkpointing) framework is used to insert checkpointing instrumentation into the application code. CPPC-G services will be in charge of the submission and monitoring of the application execution, management of checkpoint files generated by CPPC-enabled applications, and detection and automatic restart of failed exe-cutions. The development of the CPPC-G architecture will involve research in different areas such as storage and management of data files (checkpoint files); automatic selection of suitable computing resources; reliable detection of execution failures and robustness issues to make the architecture fault-tolerant itself.

show abstract

Enhancing Fault-Tolerance of Large-Scale MPI Scientific Applications

Cited by 4 publications

References 12 publications

OpenCluster: A Flexible Distributed Computing Framework for Astronomical Data Processing

OpenCluster: A Flexible Distributed Computing Framework for Astronomical Data Processing

Performance evaluation of an application-level checkpointing solution on grids

Application-Level Fault-Tolerance Solutions for Grid Computing

Contact Info

Product

Resources

About