Proceedings of the 2016 ACM SIGCOMM Conference 2016
DOI: 10.1145/2934872.2934884
|View full text |Cite
|
Sign up to set email alerts
|

Taking the Blame Game out of Data Centers Operations with NetPoirot

Abstract: Today, root cause analysis of failures in data centers is mostly done through manual inspection. More often than not, customers blame the network as the culprit. However, other components of the system might have caused these failures. To troubleshoot, huge volumes of data are collected over the entire data center. Correlating such large volumes of diverse data collected from different vantage points is a daunting task even for the most skilled technicians. In this paper, we revisit the question: how much can … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4

Citation Types

0
21
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 67 publications
(21 citation statements)
references
References 29 publications
0
21
0
Order By: Relevance
“…and socket-level logs (the time and number of bytes whenever the socket makes a read/write call) in network stack. Similarly, NetPoirot [4] collects TCP statistics at each machine's hypervisor or within individual VMs to identify root causes of failures. Comparing to EVA, these tools are too intrusive since they must run inside the server.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…and socket-level logs (the time and number of bytes whenever the socket makes a read/write call) in network stack. Similarly, NetPoirot [4] collects TCP statistics at each machine's hypervisor or within individual VMs to identify root causes of failures. Comparing to EVA, these tools are too intrusive since they must run inside the server.…”
Section: Related Workmentioning
confidence: 99%
“…Developers sometimes blame "the network" for problems they cannot diagnose; in turn, the network operators blame the developers if the network shows no signs of equipment failure or persistent congestion. As a result, identifying the entity responsible for poor performance is often the most time-consuming and expensive part of failure detection and can take from an hour to days in data centers [4]. Fortunately, once the location of the problem is correctly identified, specialized tools within that component can pinpoint and fix the problem.Existing solutions such as fine-grain packet monitoring or profiling of the end-host network stack almost all work under the assumption that the TCP congestion control is not the one to blame.…”
mentioning
confidence: 99%
“…Related work. Network monitoring for data centers is an active area of research, and several works have appeared since our previous publication [8], [13]- [16]. deTector [8] presents an algorithm to minimize the number of probes sent for detecting and localizing packet losses and latency spikes.…”
Section: Background and Related Workmentioning
confidence: 99%
“…deTector [8] presents an algorithm to minimize the number of probes sent for detecting and localizing packet losses and latency spikes. [13], [14] and [15] employ passive measurement for network faults localization. [13] presents a classification algorithm that identifies the root cause of failure using TCP statistics collected at one of the endpoints.…”
Section: Background and Related Workmentioning
confidence: 99%
See 1 more Smart Citation