Understanding the Limits of Passive Realtime Datacenter Fault Detection and Localization

Roy, Arjun; Das, Rajdeep; Zeng, Hongyi; Bagga, Jasmeet; Snoeren, Alex C.

doi:10.1109/tnet.2019.2938228

Cited by 18 publications

(16 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Machine learning techniques cover a range of unsupervised and supervised machine learning to identify and locate faults. Unsupervised learning includes detecting abrupt changes in network switch event counters by learning the normal values of those counters or by using outlier detection on TCP statistics of application flows [22]. Supervised learning techniques include using logistic regression to learn the mapping between network event data and fault classes [23] or using support vector machines, multilayer perceptrons, and random forests to learn the mapping between rate/delay/loss measures and fault classes [24].…”

Section: A Related Workmentioning

confidence: 99%

The Network Link Outlier Factor (NLOF) for Fault Localization

Mendoza

McGarry

2020

IEEE Open J. Commun. Soc.

View full text Add to dashboard Cite

We describe and experimentally evaluate the performance of our Network Link Outlier Factor (NLOF) for locating faults in communication networks. The NLOF is a unique outlier score assigned to each link in a network. It is computed using four distinct stages in a data analytics pipeline. The input to the pipeline are flow records (e.g., NetFlow) and network topology data (e.g., Link Layer Discovery Protocol (LLDP)). In the first stage, flow record throughput values are clustered in two sub-stages: using Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and then our novel domain-specific ThroughPut Cluster (TPCluster) technique. In the second stage, flow outlier scores are determined within each cluster using a measure of proximity to a selected performance exemplar. In the third stage, flows are associated with network links using topology data. Finally, in the fourth stage the flow outliers are used to compute the outlier factor or score for each network link. The network link outlier scores are used with a detection rule to locate faults. We present the results of a wide set of Mininet experiments that appraise the fault detection/localization performance of NLOF. We find that NLOF allows for the detection of errors on edge links with a simple detection rule and the detection of errors on core links with a rule that includes topology relationships. NLOF is also compared to an abrupt change detection technique; while both have roughly the same detection power, the precision of NLOF is 42% higher and NLOF required 40% less time to detect failures on average.

show abstract

Section: A Related Workmentioning

confidence: 99%

The Network Link Outlier Factor (NLOF) for Fault Localization

Mendoza

McGarry

2020

IEEE Open J. Commun. Soc.

View full text Add to dashboard Cite

show abstract

“…We simulated faulty scenarios using fault injection techniques. We injected fault types commonly used in the evaluation of state-of-the-art fault localization approaches [38,36,16,22]: packet loss, memory leak and CPU hog. For each fault we considered different severity growth patterns: (i) linear pattern, the fault is triggered with a same frequency over time, (ii) exponential pattern, the fault is activated with a frequency that increases exponentially, resulting in a shorter time to failure, (iii) random pattern, the fault is activated randomly over time.…”

Section: Investigated Faultsmentioning

confidence: 99%

Localizing Faults in Cloud Systems

Mariani

Monni

Pezzè

et al. 2018

2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST)

View full text Add to dashboard Cite

By leveraging large clusters of commodity hardware, the Cloud offers great opportunities to optimize the operative costs of software systems, but impacts significantly on the reliability of software applications. The lack of control of applications over Cloud execution environments largely limits the applicability of state-of-the-art approaches that address reliability issues by relying on heavyweight training with injected faults.In this paper, we propose LOUD, a lightweight fault localization approach that relies on positive training only, and can thus operate within the constraints of Cloud systems. LOUD relies on machine learning and graph theory. It trains machine learning models with correct executions only, and compensates the inaccuracy that derives from training with positive samples, by elaborating the outcome of machine learning techniques with graph theory algorithms. The experimental results reported in this paper confirm that LOUD can localize faults with high precision, by relying only on a lightweight positive training.

show abstract

“…deTector [8] presents an algorithm to minimize the number of probes sent for detecting and localizing packet losses and latency spikes. [13], [14] and [15] employ passive measurement for network faults localization. [13] presents a classification algorithm that identifies the root cause of failure using TCP statistics collected at one of the endpoints.…”

Section: Background and Related Workmentioning

confidence: 99%

“…[13] presents a classification algorithm that identifies the root cause of failure using TCP statistics collected at one of the endpoints. The work in [14] looks from the end-host to identify the faulty links and switches, by correlating anomalies in end-host statistics with the network path of the traffic. Vigil [15] tracks the path of TCP connections that display retransmissions through traceroute, and identifies the links with the most retransmissions as the faulty ones.…”

Section: Background and Related Workmentioning

confidence: 99%

A First Look at Data Center Network Condition Through The Eyes of PTPmesh

Popescu

Moore

2018

2018 Network Traffic Measurement and Analysis Conference (TMA)

View full text Add to dashboard Cite

Increased network latency and packets losses can affect substantially application performance. Due to the scale of data centers, custom network monitoring tools have been developed to measure network latency and packet loss. In our previous work, we used the Precision Time Protocol (PTP) to measure one-way delay and to quantify packet loss ratios, and we proposed PTPmesh as a cloud network monitoring tool. In this work, we provide a better understanding on how to exploit the measurement data offered by PTPmesh and present a detailed analysis of PTPmesh measurements collected in ten data centers from three cloud providers. Our findings reveal different latency, latency variance and packet loss characteristics across data centers. Through our analysis, we showcase the strengths and limitations of PTPmesh as a cloud network monitoring tool. To foster further research in this area, we make our dataset available.

show abstract

Understanding the Limits of Passive Realtime Datacenter Fault Detection and Localization

Cited by 18 publications

References 26 publications

The Network Link Outlier Factor (NLOF) for Fault Localization

The Network Link Outlier Factor (NLOF) for Fault Localization

Localizing Faults in Cloud Systems

A First Look at Data Center Network Condition Through The Eyes of PTPmesh

Contact Info

Product

Resources

About