Taking the Blame Game out of Data Centers Operations with NetPoirot

Arzani, Behnaz; Ciraci, Selim; Loo, Boon Thau; Schuster, Assaf; Outhred, Geoff

doi:10.1145/2934872.2934884

Cited by 67 publications

(21 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…and socket-level logs (the time and number of bytes whenever the socket makes a read/write call) in network stack. Similarly, NetPoirot [4] collects TCP statistics at each machine's hypervisor or within individual VMs to identify root causes of failures. Comparing to EVA, these tools are too intrusive since they must run inside the server.…”

Section: Related Workmentioning

confidence: 99%

“…Developers sometimes blame "the network" for problems they cannot diagnose; in turn, the network operators blame the developers if the network shows no signs of equipment failure or persistent congestion. As a result, identifying the entity responsible for poor performance is often the most time-consuming and expensive part of failure detection and can take from an hour to days in data centers [4]. Fortunately, once the location of the problem is correctly identified, specialized tools within that component can pinpoint and fix the problem.Existing solutions such as fine-grain packet monitoring or profiling of the end-host network stack almost all work under the assumption that the TCP congestion control is not the one to blame.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Network Measurement and Performance Analysis at Server Side

2018

View full text Add to dashboard Cite

Network performance diagnostics is an important topic that has been studied since the Internet was invented. However, it remains a challenging task, while the network evolves and becomes more and more complicated over time. One of the main challenges is that all network components (e.g., senders, receivers, and relay nodes) make decision based only on local information and they are all likely to be performance bottlenecks. Although Software Defined Networking (SDN) proposes to embrace a centralize network intelligence for a better control, the cost to collect complete network states in packet level is not affordable in terms of collection latency, bandwidth, and processing power. With the emergence of the new types of networks (e.g., Internet of Everything, Mission-Critical Control, data-intensive mobile apps, etc.), the network demands are getting more diverse. It is critical to provide finer granularity and real-time diagnostics to serve various demands. In this paper, we present EVA, a network performance analysis tool that guides developers and network operators to fix problems in a timely manner. EVA passively collects packet traces near the server (hypervisor, NIC, or top-of-rack switch), and pinpoints the location of the performance bottleneck (sender, network, or receiver). EVA works without detailed knowledge of application or network stack and is therefore easy to deploy. We use three types of real-world network datasets and perform trace-driven experiments to demonstrate EVA's accuracy and generality. We also present the problems observed in these datasets by applying EVA.Future Internet 2018, 10, 67 2 of 18 today's network applications adopt multi-tier architectures, which consist of user-facing front-end (e.g., reverse proxy and load balancer) and IO/CPU-intensive back-end (e.g., database query). Problems with any of these components can affect user-perceived performance. Developers sometimes blame "the network" for problems they cannot diagnose; in turn, the network operators blame the developers if the network shows no signs of equipment failure or persistent congestion. As a result, identifying the entity responsible for poor performance is often the most time-consuming and expensive part of failure detection and can take from an hour to days in data centers [4]. Fortunately, once the location of the problem is correctly identified, specialized tools within that component can pinpoint and fix the problem.Existing solutions such as fine-grain packet monitoring or profiling of the end-host network stack almost all work under the assumption that the TCP congestion control is not the one to blame. Furthermore, nearly all packet monitoring tools measure the network conditions (congestion, available bandwidth, etc.) by inferring end-hosts' congestion control status. Such approach may fail for two reasons: First, today TCP's loss-based congestion control, even with the current best of breed, Cubic [5], experience pool performance in some scenarios [3]. Second, one congestion control algorithm may work qu...

show abstract

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

Network Measurement and Performance Analysis at Server Side

2018

View full text Add to dashboard Cite

show abstract

“…Related work. Network monitoring for data centers is an active area of research, and several works have appeared since our previous publication [8], [13]- [16]. deTector [8] presents an algorithm to minimize the number of probes sent for detecting and localizing packet losses and latency spikes.…”

Section: Background and Related Workmentioning

confidence: 99%

“…deTector [8] presents an algorithm to minimize the number of probes sent for detecting and localizing packet losses and latency spikes. [13], [14] and [15] employ passive measurement for network faults localization. [13] presents a classification algorithm that identifies the root cause of failure using TCP statistics collected at one of the endpoints.…”

Section: Background and Related Workmentioning

confidence: 99%

“…[13], [14] and [15] employ passive measurement for network faults localization. [13] presents a classification algorithm that identifies the root cause of failure using TCP statistics collected at one of the endpoints. The work in [14] looks from the end-host to identify the faulty links and switches, by correlating anomalies in end-host statistics with the network path of the traffic.…”

Section: Background and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A First Look at Data Center Network Condition Through The Eyes of PTPmesh

Popescu

Moore

2018

2018 Network Traffic Measurement and Analysis Conference (TMA)

View full text Add to dashboard Cite

Increased network latency and packets losses can affect substantially application performance. Due to the scale of data centers, custom network monitoring tools have been developed to measure network latency and packet loss. In our previous work, we used the Precision Time Protocol (PTP) to measure one-way delay and to quantify packet loss ratios, and we proposed PTPmesh as a cloud network monitoring tool. In this work, we provide a better understanding on how to exploit the measurement data offered by PTPmesh and present a detailed analysis of PTPmesh measurements collected in ten data centers from three cloud providers. Our findings reveal different latency, latency variance and packet loss characteristics across data centers. Through our analysis, we showcase the strengths and limitations of PTPmesh as a cloud network monitoring tool. To foster further research in this area, we make our dataset available.

show abstract