Abstract-We present a framework and a set of algorithms for determining faults in networks when large scale outages occur. The design principles of our algorithm, netCSI, are motivated by the fact that failures are geographically clustered in such cases. We address the challenge of determining faults with incomplete symptom information due to a limited number of reporting nodes. netCSI consists of two parts: a hypotheses generation algorithm, and a ranking algorithm. When constructing the hypothesis list of potential causes, we make novel use of positive and negative symptoms to improve the precision of the results. In addition, we propose pruning and thresholding along with a dynamic threshold value selector, to reduce the complexity of our algorithm. The ranking algorithm is based on conditional failure probability models that account for the geographic correlation of the network objects in clustered failures. We evaluate the performance of netCSI for networks with both random and realistic topologies. We compare the performance of netCSI with an existing fault diagnosis algorithm, MAX-COVERAGE, and demonstrate an average gain of 128 percent in accuracy for realistic topologies. [7]. Massive outages tend to create faults at multiple components that are geographically close to each other. We call these failures clustered failures. Until now, the prior work in the area of fault diagnosis has focused on independent failures [1], [3], [7]. The performance of these algorithms degrades when applied to clustered failures. In this paper, we propose netCSI, a new algorithm that is designed to effectively identify faulty network components under clustered failures. To show the benefits of our algorithm, we compare it with an existing algorithm that is proposed for independent failures. netCSI determines possible causes of large-scale failures using a knowledge base and end-to-end symptom information. The knowledge base contains information about possible paths between different source-destination pairs and the inferred topology of the network. The end-to-end symptoms reflect end-to-end connectivity or disconnectivity in the network and are observed when a failure occurs. These symptoms include both negative information, such as which source-destination pairs are disconnected, as well as positive information, such as which source-destination pairs can still communicate.
Keywords-Fault diagnosis, large-scale network failures, incomplete information, clustered failures.
I. INTRODUCTIONOnce a clustered failure occurs, netCSI uses the knowledge base and symptoms to generate a list of possible causes of the outage, called the hypothesis list. Then, a ranking algorithm is applied to the hypothesis list to rate the possible causes.The main assumption in the existing fault diagnosis algorithms [1], [2], [7] is that complete and accurate information is available at the network manager. However, during large-scale failures, it is very unlikely that complete end-toend symptom information will be available, because reporting nodes ...