Improving efficiency of markov chain analysis of complex distributed systems

Dabrowski, Christopher; Hunt, Fern Y.; Morrison, Katherine M.

doi:10.6028/nist.ir.7744

Cited by 3 publications

(14 citation statements)

References 42 publications

(135 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The transitions in the cut set can thus be identified as critical transitions, which serve as a basis for describing potential failure scenarios. In previous work [12], we reported the results of experiments which showed that minimal s-t cut set analysis could be used to find all critical state transitions in an absorbing DTMC for a much smaller grid computing system at 1/100 th the computation cost of large-scale simulations. This exhaustive analysis need not be repeated for the problem described in this paper, as there is not the space for it.…”

Section: Discussionmentioning

confidence: 99%

“…These savings increase even more dramatically if combinations of three critical transitions are considered. Though further research is necessary, it is our belief that both in [12] and in this study, we have described an analytical approach that can aid in understanding where and how catastrophic failures may occur in complex systems. The results to date have shown that the approach is tractable for the types of problems we have examined.…”

Section: Discussionmentioning

confidence: 99%

“…Cut set enumeration algorithms are known to be computationally expensive for large problems. For instance in [12], a Markov chain with 50 states, though sparse, was found to contain over 10 8 minimal s-t cut sets on paths between the initial and absorbing states. Further, computational characteristics of directed graphs are not well understood and remain a topic for future work.…”

Section: Discussionmentioning

confidence: 99%

“…This DTMC is based on a discrete event large-scale simulation model described in [10], which we also use as a proxy for a real-world system in this study. This 39-state model is far larger than previous models we have analyzed [11][12][13], which consisted of only seven states. To this DTMC, we apply our combined method to find likely failure scenarios, which we verify in the large-scale model.…”

Section: Introductionmentioning

confidence: 97%

“…VII), and conclusions (Sec VIII). While the approach presented here employs minimal s-t cut set analysis as its basis, elsewhere we describe the use of spectral methods for eigendecomposition to identify critical state transitions [13] and algorithms for exhaustive search of a Markov chain transition probability matrix [12].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Identifying Failure Scenarios in Complex Systems by Perturbing Markov Chain Models

Dabrowski

Hunt

2011

Volume 6: Materials and Fabrication, Parts a and B

Self Cite

View full text Add to dashboard Cite

In recent years, substantial research has been devoted to monitoring and predicting performance degradations in real-world complex systems within large entities such as nuclear power plants, electrical grids, and distributed computing systems. Special challenges are posed by the fact that such systems operate in uncertain environments, are highly dynamic, and exhibit emergent behaviors that can lead to catastrophic failure. Discrete Time Markov chains (DTMCs) provide important tools for analysis of such systems, because they represent dynamic behavior succinctly, provide a means to measure uncertainty, and can be used to make quantitative measurements of the potential for change to system performance. Moreover, DTMCs can be extended to be time-inhomogeneous, i.e. to represent behavior that varies over long durations. To date, DTMCs have been proposed for tasks such as fault detection and long-term condition equipment monitoring in realworld complex systems. However, the scope of these models has generally been restricted to describing states and state transitions that directly concern fault conditions or states of degradation. Less work has been done on using DTMCs to represent a more complete range of states a system may enter into during normal operation. Of special interest are sequences of states that involve failure scenarios, in which a system evolves from a normal operating state into undesirable state that leads to widespread performance degradation. Unfortunately, use of large DTMCs often involves large search spaces, a problem which in part motivates our work. This paper describes progress made on developing an approach for using larger, more detailed DTMC models of operational complex systems to uncover potential failure scenarios. The approach uses a combination of methods to perturb a DTMC, simulate alternative system evolutions, and identify scenarios in which a system proceeds from normal operation to failure. Key to the approach is the use of graph theory techniques to reduce the size of the search space involved in exploring alternative behaviors. We show how graph theory techniques can be used to identify critical state transitions which can be perturbed to simulate performance degradation. Using critical transitions, it is also possible to estimate the rate of performance degradation and to understand how this rate is likely to change in response to increased failure incidence. Examples are provided of the use of this approach on a DTMC of significant size to identify failure scenarios in a distributed resource allocation system.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 97%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Identifying Failure Scenarios in Complex Systems by Perturbing Markov Chain Models

Dabrowski

Hunt

2011

Volume 6: Materials and Fabrication, Parts a and B

Self Cite

View full text Add to dashboard Cite

show abstract

Introduction

Sheremet¹

2022

Multigrammatical Framework for Knowledge-Based Digital Economy

View full text Add to dashboard Cite

Spectral based Methods that Streamline the Search for Failure Scenarios in Large-Scale Distributed Systems

Hunt¹,

Morrison²,

Dabrowski³

2011

Applied Simulation and Modelling

Self Cite

View full text Add to dashboard Cite

We report our work on the development of analytical and numerical methods that enable the detection of failure scenarios in distributed grid computing, cloud computing and other large scale systems.The spectral (i.e. eigenvalue and eigenvector) properties of the matrices associated with a non-homogeneous absorbing Markov Chain are used to quickly compute the long time proportion of tasks completed at a given setting of parameters. This enables the discovery of critical ranges of parameter values where system performance deteriorates and fails.

show abstract

Improving efficiency of markov chain analysis of complex distributed systems

Cited by 3 publications

References 42 publications

Identifying Failure Scenarios in Complex Systems by Perturbing Markov Chain Models

Identifying Failure Scenarios in Complex Systems by Perturbing Markov Chain Models

Introduction

Spectral based Methods that Streamline the Search for Failure Scenarios in Large-Scale Distributed Systems

Contact Info

Product

Resources

About