California fault lines

Turner, Daniel; Levchenko, Kirill; Snoeren, Alex C.; Savage, Stefan

doi:10.1145/1851182.1851220

Cited by 107 publications

(5 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fig. 1 shows the distribution of failure impact (i.e., maximum link utilization (MLU) increase) under OSPF 2 and optimal (MCF) routing scheme for three real-world large scale network topologies with more than 100 nodes. It turns out that only 0.19%, 0.03%, and 3.43% failure scenarios on Ion, Interoute, and DialtelecomCz, respectively, under the optimal routing scheme, cause significant impact (i.e., more than 80% of worst-case failure impact) to the network availability.…”

Section: Motivation Of Fernmentioning

confidence: 99%

See 1 more Smart Citation

Siamese Graph Attention Networks for robust visual object tracking

Guo

et al. 2023

Computer Vision and Image Understanding

View full text Add to dashboard Cite

Section: Motivation Of Fernmentioning

confidence: 99%

“…As the scale and complexity of modern networks continue to increase rapidly, the occurrence of failures has become a common and frequent event in both wide area networks (WANs) [1], [2], [3], [4], [5], [6] and data center networks (DCNs) [7]. Recently, increasing research efforts have considered tackling the problem from different aspects.…”

Section: Introductionmentioning

confidence: 99%

Siamese Graph Attention Networks for robust visual object tracking

Guo

et al. 2023

Computer Vision and Image Understanding

View full text Add to dashboard Cite

“…In view of that, in Algorithm 7, whenever a valid notification is received, the procedure will also verify whether the notification was received within a reasonably short amount of time of the previously detected local failure (Lines 7-8). According to measurement studies (GOVINDAN et al, 2016;TURNER et al, 2012;GILL;JAIN;NAGAPPAN, 2011;TURNER et al, 2010;MARKOPOULOU et al, 2008), the majority of times, when two links fail simultaneously, they belong to the same shared-risk group. With that observation in mind, whenever a switch is aware of two single-link failures happening in a short period in time, it will transition to a tactic that handles the shared-risk group to which the two links belong by looking up table SF-TACTICS (L.9-11).…”

Section: Switch and Other Shared-risk Multi-link Failuresmentioning

confidence: 99%

“…Consequently, computing alternative forwarding entries for all possible failure scenarios would both take an impractically long time and require prohibitive amounts of memory. Despite that, network measurement studies (GOVINDAN et al, 2016;TURNER et al, 2012;GILL;JAIN;NAGAPPAN, 2011;TURNER et al, 2010;MARKOPOULOU et al, 2008) show that although failures happen frequently (a few minutes apart), it is very unusual for two elements (e.g., links, switches, optical-fiber cable) to fail at the "same time", unless they belong to the same shared-risk group. For example, two links connected to the same switch are perceived as failed whenever the switch itself fails.…”

Section: Many Failure Scenariosmentioning

confidence: 99%

Advancing Network Monitoring and Operation with In-band Network Telemetry and Data Plane Programmability

Marques

Gaspary

2023

NOMS 2023-2023 IEEE/IFIP Network Operations and Management Symposium

View full text Add to dashboard Cite

Modern communication networks operate under high expectations on performance and resilience (e.g., latency, bandwidth, availability) mainly due to the continuous proliferation of non-elastic highly-distributed applications. In this context, closely monitoring the state, behavior, and performance of networking devices and their traffic as well as quickly troubleshooting problems as they arise is essential for the operation of network infrastructures. Unfortunately, existing tools and techniques fall short at providing the required level of detail, enabling quick reactions, and keeping monitoring overhead from affecting the network operation. Data Plane Programmability (DPP) along with In-band Network Telemetry (INT), backed by the recent advances in Software-Defined Networking, emerge in this context as promising platforms to meet these monitoring demands. INT enables unprecedented monitoring accuracy and precision, but may lead to performance degradation if applied indiscriminately to all packet flows in a network. One alternative to avoid this issue is to orchestrate telemetry tasks and use only a portion of traffic to monitor the network via INT. The general problem consists, then, in assigning subsets of traffic to carry out INT and provide full monitoring coverage while minimizing the overhead. To achieve this goal, as a first step in this thesis, we introduce and formalize the In-band Network Telemetry Orchestration (INTO) problem, prove that it is NP-Complete, andpropose polynomial computing time heuristics to solve it. In our evaluation using real wide-area network topologies, we observe that the heuristics produce solutions close to optimal to any network in under one second. We also observe that networks can be covered assigning a linear number of flows in relation to the number of device interfaces and, finally, that it is possible to minimize telemetry load to one interface per flow for most networks. Continuing our work, we investigate DPP capabilities further and design INTSIGHT, a system for highly accurate and fine-grained detection and diagnosis of SLO violations. The main contribution of INTSIGHT is, building upon in-band telemetry, introducing path-wise computation of network metrics and selective generation of reports.We show the effectiveness of INTSIGHT by way of two use cases. Our evaluation using real networks also shows that INTSIGHT generates up to two orders of magnitude less monitoring traffic than state-of-the-art approaches. Furthermore, its processing and memory requirements are low and therefore compatible with currently existing programmable platforms. As a final step in this thesis, we shift our focus to quick reaction and propose FELIX, a system for failure recovery that reroutes around failures at data-plane timescales while still using the shortest available paths. Our evaluation shows that our approach can recover from failures up to four orders of magnitude faster than existing SDN approaches while making sensible use of data-plane resources. Finally, with the design of FELIX...

show abstract

“…There already exists interesting literature on the empirical characteristics of failures, e.g., in datacenters [12], [31], statewide networks [29], or IP backbones [16]. This literature is highly valuable for the comparison of existing networks, but does not directly solve the problem of comparing network designs that are not yet implemented.…”

Section: Related Workmentioning

confidence: 99%

The Hazard Value: A Quantitative Network Connectivity Measure Accounting for Failures

Cuijpers

Schmid

Schnepf

et al. 2022

2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

View full text Add to dashboard Cite

To meet their stringent requirements in terms of performance and dependability, communication networks should be "well connected". While classic connectivity measures typically revolve around topological properties, e.g., related to cuts, these measures may not reflect well the degree to which a network is actually dependable. We introduce a more refined measure for network connectivity, the hazard value, which is developed to meet the needs of a real network operator. It accounts for crucial aspects affecting the dependability experienced in practice, including actual traffic patterns, distribution of failure probabilities, routing constraints, and alternatives for services with preferences therein. We analytically show that the hazard value fulfills several fundamental desirable properties that make it suitable for comparing different network topologies with one another, and for reasoning about how to efficiently enhance the robustness of a given network. We also present an optimised algorithm to compute the hazard value and an experimental evaluation against networks from the Internet Topology Zoo and classical datacenter topologies, such as fat trees and BCubes. This evaluation shows that the algorithm computes the hazard value within minutes for realistic networks, making it practically usable for network designers.

show abstract

California fault lines

Cited by 107 publications

References 30 publications

Siamese Graph Attention Networks for robust visual object tracking

Siamese Graph Attention Networks for robust visual object tracking

Advancing Network Monitoring and Operation with In-band Network Telemetry and Data Plane Programmability

The Hazard Value: A Quantitative Network Connectivity Measure Accounting for Failures

Contact Info

Product

Resources

About