2015
DOI: 10.1145/2829988.2787483
|View full text |Cite
|
Sign up to set email alerts
|

Packet-Level Telemetry in Large Datacenter Networks

Abstract: Debugging faults in complex networks often requires capturing and analyzing traffic at the packet level. In this task, datacenter networks (DCNs) present unique challenges with their scale, traffic volume, and diversity of faults. To troubleshoot faults in a timely manner, DCN administrators must a) identify affected packets inside large volume of traffic; b) track them across multiple network components; c) analyze traffic traces for fault patterns; and d) test or confirm potential causes. To our knowledge, n… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

1
119
0
3

Year Published

2016
2016
2021
2021

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 116 publications
(123 citation statements)
references
References 25 publications
1
119
0
3
Order By: Relevance
“…Prior switch-based network measurement systems [18,19,32,23,40,22] are constrained by their raw data, sourced from limited measurement support in existing switches, e.g., NetFlow, match-action rules, and packet mirroring. More recently, Gupta et al [22] propose to partition monitoring queries between switches and a stream processor (e.g., Spark streaming [12]), iteratively refining the set of packets captured through match-action rules in the switch.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Prior switch-based network measurement systems [18,19,32,23,40,22] are constrained by their raw data, sourced from limited measurement support in existing switches, e.g., NetFlow, match-action rules, and packet mirroring. More recently, Gupta et al [22] propose to partition monitoring queries between switches and a stream processor (e.g., Spark streaming [12]), iteratively refining the set of packets captured through match-action rules in the switch.…”
Section: Related Workmentioning
confidence: 99%
“…Delays in identifying and diagnosing network performance problems can severely affect service availability. As a result, both industry and academia have expended considerable effort in network measurement [3,6,11,18,26,32,40].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Consequently, the sampled end to end latencies provide coarse-grained metrics, but have low fidelity unless the sampling rates are very high. For example, software bugs or faulty interfaces in switches may randomly produce failures on some packets according to the route in the network or packet header information [31]. Unfortunately, high sampling frequencies have high bandwidth cost, computing cost, and storage overhead.…”
Section: Introductionmentioning
confidence: 99%
“…In the previous example, if 95 TCAM rules are impacted, the hit ratio is 0.95. The wide variation of hit ratio values can occur due to (1) switch TCAM overflow; (2) TCAM corruption [11] that causes bit errors on a specific field in a TCAM rule or across TCAM rules; and (3) software bugs [12] that modify object's value wrong at controller or switch agent. While the SCORE algorithm allows change of a threshold value to handle noisy input data, such a static mechanism helps little in solving the problem at hand, confirmed by our evaluation results in §VI.…”
mentioning
confidence: 99%