Proceedings of the 2016 ACM SIGCOMM Conference 2016
DOI: 10.1145/2934872.2934891
|View full text |Cite
|
Sign up to set email alerts
|

Evolve or Die

Abstract: Maintaining the highest levels of availability for content providers is challenging in the face of scale, network evolution, and complexity. Little, however, is known about the network failures large content providers are susceptible to, and what mechanisms they employ to ensure high availability. From a detailed analysis of over 100 high-impact failure events within Google's network, encompassing many data centers and two WANs, we quantify several dimensions of availability failures. We find that failures are… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
23
0
2

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 168 publications
(25 citation statements)
references
References 23 publications
0
23
0
2
Order By: Relevance
“…Canary Testing (A-B Testing [3,52,59]) Canary testing, well documented by Google's [20,46,59] and Facebook's [52] networking and infrastructure teams, requires running multiple versions of a program alongside each other. Canarying (or A-B Testing) tests new code by sending a subset of traffic through the code (e.g., 1% of traffic) and, if nothing "bad" happens, slowly increases the subset of traffic using the test code until all traffic is using the test code.…”
Section: Rapid Development In Large Networkmentioning
confidence: 99%
See 1 more Smart Citation
“…Canary Testing (A-B Testing [3,52,59]) Canary testing, well documented by Google's [20,46,59] and Facebook's [52] networking and infrastructure teams, requires running multiple versions of a program alongside each other. Canarying (or A-B Testing) tests new code by sending a subset of traffic through the code (e.g., 1% of traffic) and, if nothing "bad" happens, slowly increases the subset of traffic using the test code until all traffic is using the test code.…”
Section: Rapid Development In Large Networkmentioning
confidence: 99%
“…Despite these rapid development and prototyping cycles, the existing PDP ecosystem lacks appropriate primitives and algorithms to support rapid testing and deployment. At a high level, many testing paradigms [31,52,59], e.g., canary testing used in Google's [20,46] networks, require running new versions of a program alongside stable versions. Traffic is split across all versions and the output is compared.…”
mentioning
confidence: 99%
“…Centrally, a highly available peering edge and high feature velocity are often at odds with each other. Having a highly available system often implies a more slowly-changing system because management operations are often the cause of unavailability [16]. Trying to improve availability beyond the baseline for traditional deployments, while increasing feature velocity, entails substantial architectural care.…”
Section: Background and Requirementsmentioning
confidence: 99%
“…This approach achieves three main objectives: (i) global traffic optimization to improve efficiency, (ii) improved reliability as the local control plane can operate independently of the global controller, and (iii) fast reaction to local network events, for example on peering port or device failure the local controller performs local repair while awaiting globally optimized allocation from the global controller. (2) We support fail static for high availability [16]. The data plane maintains the last known good state so that the control plane may be unavailable for short periods without impacting packet forwarding.…”
Section: Design Principlesmentioning
confidence: 99%
See 1 more Smart Citation