Performance under failures of high-end computing

Wu, Ming; Sun, Xian-He; Jin, Hui

doi:10.1145/1362622.1362687

Cited by 29 publications

(23 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Requests in service are subject to failures, with failure rate α. Both service and failure times are exponentially distributed, a common assumption in reliability engineering [24,23]. In case of a failure, the request currently in service is lost, but the server itself is not affected, and continues to serve the next request that enters.…”

Section: Reference Modelmentioning

confidence: 99%

Tackling Latency via Replication in Distributed Systems

Qiu

Pérez

Harrison

2016

Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering

View full text Add to dashboard Cite

Consistently high reliability and low latency are twin requirements common to many forms of distributed processing; for example, server farms and mirrored storage access. To address them, we consider replication of requests with canceling -i.e. initiate multiple concurrent replicas of a request and use the first successful result returned, canceling all outstanding replicas. This scheme has been studied recently, but mostly for systems with a single central queue, while server farms exploit distributed resources for scalability and robustness. We develop an approximate stochastic model to determine the response-time distribution in a system with distributed queues, and compare its performance against its centralized counterpart. Validation against simulation indicates that our model is accurate for not only the mean response time but also its percentiles, which are particularly relevant for deadline-driven applications. Further, we show that in the distributed set-up, replication with canceling has the potential to reduce response times, even at relatively high utilization. We also find that it offers response times close to those of the centralized system, especially at medium-to-high request reliability. These findings support the use of replication with canceling as an effective mechanism for both fault-and delay-tolerance.

show abstract

Section: Reference Modelmentioning

confidence: 99%

Tackling Latency via Replication in Distributed Systems

Qiu

Pérez

Harrison

2016

Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering

View full text Add to dashboard Cite

show abstract

“…This is a common assumption [10,48], and an appropriate starting case. It is not a universal assumption however [38,51], and we address alternate distributions in Sections 5 and 6.…”

Section: Exponentially-distributed Node Failuresmentioning

confidence: 99%

“…However, checkpoint sizes are increasing faster than checkpoint bandwidths [38]. It has been shown that the collision of these trends will render Exascale systems as "useless" due to checkpoint/restart overheads [12], and thus it is time for new reliability strategies to be explored [38,13,48].…”

Section: Introductionmentioning

confidence: 99%

A Model-Based Case for Redundant Computation

Stearley

Robinson

Ferreira

et al. 2011

View full text Add to dashboard Cite

Despite its seemingly nonsensical cost, we show through modeling and simulation that redundant computation merits full consideration as a resilience strategy for next-generation systems. Without revolutionary breakthroughs in failure rates, part counts, or stable-storage bandwidths, it has been shown that the utility of Exascale systems will be crushed by the overheads of traditional checkpoint/restart mechanisms. Alternate resilience strategies must be considered, and redundancy is a proven unrivaled approach in many domains. We develop a distribution-independent model for job interrupts on systems of arbitrary redundancy, adapt Daly's model for total application runtime, and find that his estimate for optimal checkpoint interval remains valid for redundant systems. We then identify conditions where redundancy is more cost effective than non-redundancy. These are done in the context of the number one supercomputers of the last decade, showing that thorough consideration of redundant computation is timely -if not overdue.

show abstract

“…In [27], we have presented a performance model to estimate the mean, variance and distribution of a single sequential task computation time. We adopt this model to estimate the computation time of each subtask in the DAG.…”

Section: B Modeling Of Subtask Computation Timementioning

confidence: 99%

“…We first predict the performance of subtasks based on our previous work [27], in which all subtasks are independent. This prediction provides the prediction of subtasks under one layer of a general DAG.…”

Section: Introductionmentioning

confidence: 99%

Performance under Failures of DAG-based Parallel Computing

Jin

Sun

Zheng

et al. 2009

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid

Self Cite

View full text Add to dashboard Cite

Abstract-As the scale and complexity of parallel systems continue to grow, failures become more and more an inevitable fact for solving large-scale applications. In this research, we present an analytical study to estimate execution time in the presence of failures of directed acyclic graph (DAG) based Scientific Applications and provide a guideline for performance optimization. The study is four fold. We first introduce a performance model to predict individual subtask computation time under failures. Next, a layered, iterative approach is adopted to transform a DAG into a layered DAG, which reflects full dependencies among all the subtasks. Then, the expected execution time under failures of the DAG is derived based on stochastic analysis. Unlike existing models, this newly proposed performance model provides both the variance and distribution. It is practical and can be put to real use. Finally, based on the model, performance optimization, weak point identification and enhancement are proposed. Intensive simulations are conducted to verify the analytical findings. They show that the newly proposed model and weak point enhancement mechanism work well.

show abstract

Performance under failures of high-end computing

Cited by 29 publications

References 13 publications

Tackling Latency via Replication in Distributed Systems

Tackling Latency via Replication in Distributed Systems

A Model-Based Case for Redundant Computation

Performance under Failures of DAG-based Parallel Computing

Contact Info

Product

Resources

About