netCSI: A Generic Fault Diagnosis Algorithm for Large-Scale Failures in Computer Networks

Tati, Srikar; Rager, Scott; Ko, Bong Jun; Swami, Ananthram; Porta, Thomas La

doi:10.1109/srds.2011.28

“…In this context, there can be three possible kinds of information: cluster information, object distance information (OD), and no information (NI). We do not describe the conditional failure probability model with no information (CFPM-NI) here because it is straightforward, and details can be found in [21].…”

Section: Conditional Failure Probability Modelsmentioning

confidence: 99%

An Analytical Survey on Diagnosis Algorithm in Generic Defect Large-Scale Failures in Computer Networks

Vishwakarma

¹

,

Ansari

²

2018

IJARCSSE

0

View full text Add to dashboard Cite

Abstract-We present a framework and a set of algorithms for determining faults in networks when large scale outages occur. The design principles of our algorithm, netCSI, are motivated by the fact that failures are geographically clustered in such cases. We address the challenge of determining faults with incomplete symptom information due to a limited number of reporting nodes. netCSI consists of two parts: a hypotheses generation algorithm, and a ranking algorithm. When constructing the hypothesis list of potential causes, we make novel use of positive and negative symptoms to improve the precision of the results. In addition, we propose pruning and thresholding along with a dynamic threshold value selector, to reduce the complexity of our algorithm. The ranking algorithm is based on conditional failure probability models that account for the geographic correlation of the network objects in clustered failures. We evaluate the performance of netCSI for networks with both random and realistic topologies. We compare the performance of netCSI with an existing fault diagnosis algorithm, MAX-COVERAGE, and demonstrate an average gain of 128 percent in accuracy for realistic topologies. [7]. Massive outages tend to create faults at multiple components that are geographically close to each other. We call these failures clustered failures. Until now, the prior work in the area of fault diagnosis has focused on independent failures [1], [3], [7]. The performance of these algorithms degrades when applied to clustered failures. In this paper, we propose netCSI, a new algorithm that is designed to effectively identify faulty network components under clustered failures. To show the benefits of our algorithm, we compare it with an existing algorithm that is proposed for independent failures. netCSI determines possible causes of large-scale failures using a knowledge base and end-to-end symptom information. The knowledge base contains information about possible paths between different source-destination pairs and the inferred topology of the network. The end-to-end symptoms reflect end-to-end connectivity or disconnectivity in the network and are observed when a failure occurs. These symptoms include both negative information, such as which source-destination pairs are disconnected, as well as positive information, such as which source-destination pairs can still communicate. Keywords-Fault diagnosis, large-scale network failures, incomplete information, clustered failures. I. INTRODUCTIONOnce a clustered failure occurs, netCSI uses the knowledge base and symptoms to generate a list of possible causes of the outage, called the hypothesis list. Then, a ranking algorithm is applied to the hypothesis list to rate the possible causes.The main assumption in the existing fault diagnosis algorithms [1], [2], [7] is that complete and accurate information is available at the network manager. However, during large-scale failures, it is very unlikely that complete end-toend symptom information will be available, because reporting nodes ...

show abstract

“…However, there is considerable interest on other various aspects of large-scale failures in current literature [11], [12]. Recently, netCSI [13], a combinatorial based algorithm is proposed to diagnose large-scale failures. However, there is a limitation of run-time in large networks.…”

Section: Related Workmentioning

confidence: 99%

“…During a series of failues that include both independent and clustered, AMC results in a reduced number of false negatives and false positives. [13] is proposed to localize largescale failures in networks. It is shown that by considering the failure patterns of large-scale outages, this algorithm can achieve higher accuracy than existing algorithms developed for independent failures [10].…”

mentioning

confidence: 99%

“…Due to discrepancy in failure patterns, the performance of fault diagnosis techniques that are focused on independent failures [3], [10] degrades when applied to clustered failures. A fault diagnosis algorithm called netCSI [13] is proposed to localize largescale failures in networks. It is shown that by considering the failure patterns of large-scale outages, this algorithm can achieve higher accuracy than existing algorithms developed for independent failures [10].…”

mentioning

confidence: 99%

“…As explained in Section I, a high number of false negatives is unacceptable during large-scale failures, since the network manager cannot localize them all or must incur a high cost to diagnose them. netCSI [13], a combinatorial approach based algorithm, was proposed to diagnose large-scale failures. This is a two step algorithm that generates a hypotheses list which has different possible combinations of objects that could have failed during large-scale failures.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Adaptive Algorithms for Diagnosing Large-Scale Failures in Computer Networks

Tati

¹

,

Ko

²

,

Swami

³

et al. 2015

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-In this paper, we propose an algorithm to efficiently diagnose large-scale clustered failures. The algorithm, Cluster-MAX-COVERAGE (CMC), is based on greedy approach. We address the challenge of determining faults with incomplete symptoms. CMC makes novel use of both positive and negative symptoms to output a hypothesis list with a low number of false negatives and false positives quickly. CMC requires reports from about half as many nodes as other existing algorithms to determine failures with 100% accuracy. Moreover, CMC accomplishes this gain significantly faster (sometimes by two orders of magnitude) than an algorithm that matches its accuracy. Furthermore, we propose an adaptive algorithm called Adaptive-MAX-COVERAGE (AMC) that performs efficiently during both kinds of failures, i.e., independent and clustered. During a series of failues that include both independent and clustered, AMC results in a reduced number of false negatives and false positives. [13] is proposed to localize largescale failures in networks. It is shown that by considering the failure patterns of large-scale outages, this algorithm can achieve higher accuracy than existing algorithms developed for independent failures [10]. However, the drawback of netCSI is that the run-time complexity of the algorithm increases exponentially with the increase in size of networks since it is a combinatorial approach. Keywords-FaultIn this paper, we propose a new algorithm called Cluster-MAX-COVERAGE (CMC) that diagnoses large-scale clustered failures. To identify the faulty network elements (i.e., network nodes, routers, and links) CMC utilizes a knowledge base of possible network paths and end-to-end symptom information. The observed end-to-end symptoms during failures include both negative symptoms, such as which source-destination pairs are disconnected, as well as positive symptoms, such as which source-destination pairs can still communicate. This information is reported to the network manager by a few selected nodes in the network called reporting nodes; a complete list of symptoms is not required. Using this information, CMC outputs a hypothesis list which consists of a set of network elements whose failures are consistent with the symptoms.To solve the issue of run-time complexity, CMC adopts a greedy approach when generating the hypothesis list of faulty network elements, as opposed to the combinatorial approach in netCSI. Our greedy approach is similar to a fault diagnosis algorithm called MAX-COVERAGE (MC) [10], which is developed to diagnose black holes or silent failures (independent failures) in IP networks. During clustered failures, the performance of MC degrades significantly-in particular it produces a prohibitively high number of false negatives (see Section V-C1). To overcome this limitation, CMC uses clusters of objects instead of single objects when forming the hypothesis list.The major contributions of CMC include:• Clustering models: To diagnose large-scale failures, CMC selects clusters of objects greedily based on c...

show abstract

Adaptive algorithms for diagnosing large-scale failures in computer networks

Tati

¹

,

Ko

²

,

Swami

³

et al. 2012

IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)

Self Cite

View full text Add to dashboard Cite

Abstract-In this paper, we propose an algorithm to efficiently diagnose large-scale clustered failures. The algorithm, Cluster-MAX-COVERAGE (CMC), is based on greedy approach. We address the challenge of determining faults with incomplete symptoms. CMC makes novel use of both positive and negative symptoms to output a hypothesis list with a low number of false negatives and false positives quickly. CMC requires reports from about half as many nodes as other existing algorithms to determine failures with 100% accuracy. Moreover, CMC accomplishes this gain significantly faster (sometimes by two orders of magnitude) than an algorithm that matches its accuracy. Furthermore, we propose an adaptive algorithm called Adaptive-MAX-COVERAGE (AMC) that performs efficiently during both kinds of failures, i.e., independent and clustered. During a series of failues that include both independent and clustered, AMC results in a reduced number of false negatives and false positives. [13] is proposed to localize largescale failures in networks. It is shown that by considering the failure patterns of large-scale outages, this algorithm can achieve higher accuracy than existing algorithms developed for independent failures [10]. However, the drawback of netCSI is that the run-time complexity of the algorithm increases exponentially with the increase in size of networks since it is a combinatorial approach. Keywords-FaultIn this paper, we propose a new algorithm called Cluster-MAX-COVERAGE (CMC) that diagnoses large-scale clustered failures. To identify the faulty network elements (i.e., network nodes, routers, and links) CMC utilizes a knowledge base of possible network paths and end-to-end symptom information. The observed end-to-end symptoms during failures include both negative symptoms, such as which source-destination pairs are disconnected, as well as positive symptoms, such as which source-destination pairs can still communicate. This information is reported to the network manager by a few selected nodes in the network called reporting nodes; a complete list of symptoms is not required. Using this information, CMC outputs a hypothesis list which consists of a set of network elements whose failures are consistent with the symptoms.To solve the issue of run-time complexity, CMC adopts a greedy approach when generating the hypothesis list of faulty network elements, as opposed to the combinatorial approach in netCSI. Our greedy approach is similar to a fault diagnosis algorithm called MAX-COVERAGE (MC) [10], which is developed to diagnose black holes or silent failures (independent failures) in IP networks. During clustered failures, the performance of MC degrades significantly-in particular it produces a prohibitively high number of false negatives (see Section V-C1). To overcome this limitation, CMC uses clusters of objects instead of single objects when forming the hypothesis list.The major contributions of CMC include:• Clustering models: To diagnose large-scale failures, CMC selects clusters of objects greedily based on c...

show abstract

netCSI: A Generic Fault Diagnosis Algorithm for Large-Scale Failures in Computer Networks

Cited by 6 publications

References 11 publications

An Analytical Survey on Diagnosis Algorithm in Generic Defect Large-Scale Failures in Computer Networks

An Analytical Survey on Diagnosis Algorithm in Generic Defect Large-Scale Failures in Computer Networks

Adaptive Algorithms for Diagnosing Large-Scale Failures in Computer Networks

Adaptive algorithms for diagnosing large-scale failures in computer networks

Contact Info

Product

Resources

About