Self-stabilizing Reconfiguration

Georgiou, Chryssis; Marcoullis, Ioannis; Schiller, Elad Michael

doi:10.1007/978-3-319-59647-1_5

Cited by 7 publications

(12 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The early solutions [3,15] model node failures as crashes and restrict the number f of failing servers (nodes) to be less than half of the nodes in the system. We follow a similar approach but require that in the presence of transient faults, and only then, a crashed node either restarts (we call this a detectable restart) or is removed from the system via a reconfiguration service [8]. Moreover, as specified in [10], our restriction on the number of crashes f is similar to the one of CAS [4].…”

Section: Benign Failuresmentioning

confidence: 99%

“…Self-Stabilization in the Presence of Seldom Fairness. Dolev et al [8] proposed the following refinement of Dijkstra's design criteria of selfstabilization, which we believe to be convenient for dealing with the asynchronous nature of distributed systems. In the absence of transient faults, the environment is assumed to be asynchronous.…”

Section: Self-stabilizationmentioning

confidence: 99%

See 1 more Smart Citation

Self-stabilization Overhead: A Case Study on Coded Atomic Storage

et al. 2019

Self Cite

View full text Add to dashboard Cite

Shared memory emulation on distributed message-passing systems has attracted much attention over the past three decades. It can be used as a fault-tolerant and highly available distributed storage solution or as a low-level synchronization primitive. Examples of its uses can be found in cloud computing and cloud storage. Attiya, Bar-Noy, and Dolev were the first to propose a single-writer, multi-reader linearizable register emulation where the register is replicated to all servers. Many works followed; considering solutions for the multi-writer, multi-reader setting, as well as for supporting dynamic server participation. Recently, Cadambe et al. proposed the Coded Atomic Storage (CAS) algorithm, which uses erasure coding for achieving data redundancy with much lower communication cost than previous algorithmic solutions.Although CAS can tolerate server crashes, it was not designed to recover from unexpected, transient faults, without the need of external (human) intervention. In this respect, Dolev, Petig, and Schiller have recently developed a self-stabilizing version of CAS, which we call CASSS. As one would expect, self-stabilization does not come as a free lunch; it introduces, mainly, communication overhead for detecting inconsistencies and stale information. So, one would wonder whether the overhead introduced by self-stabilization would nullify the gain of erasure coding.To answer this question, we have implemented and experimentally evaluated the CASSS algorithm on PlanetLab; a planetary scale distributed infrastructure. The evaluation shows that our implementation of CASSS scales very well in terms of the number of servers, the number of concurrent clients, as well as the size of the replicated object. More importantly, it shows (a) to have only a constant overhead compared to the traditional CAS algorithm (which we also implement) and (b) the recovery period (after the last occurrence of a transient fault) is as fast as a

show abstract

Section: Benign Failuresmentioning

confidence: 99%

Section: Self-stabilizationmentioning

confidence: 99%

Self-stabilization Overhead: A Case Study on Coded Atomic Storage

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…The advantage here is two folded: (i) systems that can reconfigure the set P are more durable since they can replace failing nodes with new ones, and (ii) they allow us to relax the assumption that failing node eventually restart (Section 2). As an alternative approach for implementing the self-stabilizing procedure for global reset, we propose to base the reset procedure on a self-stabilizing consensus algorithm, e.g., [9], and quorum reconfiguration [16]. Note that the system settings of [9,16] assume the availability of failure detector mechanisms, and the relevant liveness conditions for implementing these mechanisms.…”

Section: Bounded Variations On Algorithms 3 Andmentioning

confidence: 99%

“…As an alternative approach for implementing the self-stabilizing procedure for global reset, we propose to base the reset procedure on a self-stabilizing consensus algorithm, e.g., [9], and quorum reconfiguration [16]. Note that the system settings of [9,16] assume the availability of failure detector mechanisms, and the relevant liveness conditions for implementing these mechanisms. Moreover, quorum reconfiguration requires the use of state transfer procedure after every reconfiguration.…”

Section: Bounded Variations On Algorithms 3 Andmentioning

confidence: 99%

Self-Stabilizing Snapshot Objects for Asynchronous Failure-Prone Networked Systems

Georgiou

Lundström

Schiller

2019

Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing

Self Cite

View full text Add to dashboard Cite

A snapshot object simulates the behavior of an array of single-writer/multi-reader shared registers that can be read atomically. Delporte-Gallet et al. proposed two fault-tolerant algorithms for snapshot objects in asynchronous crash-prone messagepassing systems. Their first algorithm is non-blocking; it allows snapshot operations to terminate once all write operations had ceased. It uses O(n) messages of O(n · ν) bits, where n is the number of nodes and ν is the number of bits it takes to represent the object. Their second algorithm allows snapshot operations to always terminate independently of write operations. It incurs O(n 2 ) messages.The fault model of Delporte-Gallet et al. considers both node failures (crashes). We aim at the design of even more robust snapshot objects. We do so through the lenses of self-stabilization-a very strong notion of fault-tolerance. In addition to Delporte-Gallet et al.'s fault model, a self-stabilizing algorithm can recover after the occurrence of transient faults; these faults represent arbitrary violations of the assumptions according to which the system was designed to operate (as long as the code stays intact).In particular, in this work, we propose self-stabilizing variations of Delporte-Gallet et al.'s non-blocking algorithm and always-terminating algorithm. Our algorithms have similar communication costs to the ones by Delporte-Gallet et al. and O(1) recovery time (in terms of asynchronous cycles) from transient faults. The main differences are that our proposal considers repeated gossiping of O(ν) bits messages and deals with bounded space (which is a prerequisite for self-stabilization). Lastly, we explain how to extend the proposed solutions to reconfigurable ones.

show abstract

“…Part (1). We note that p i modifies replyDB i only in line 12 and line 16 in the do-forever loop (lines [11][12][13][14][15][16][17][18][19][20][21][22][23][24], and in lines 26 and 27 in the query reply procedure (lines [25][26][27]. In line 12 and line 16, the size of replyDB i either decreases (possible only at the first step that p i executes line 12 or line 16) or stays the same.…”

Section: Lemma 2 (Bounded Controller Memory)mentioning

confidence: 99%

Renaissance: A Self-Stabilizing Distributed SDN Control Plane

Canini

Salem

Schiff

et al. 2018

2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS)

View full text Add to dashboard Cite

By introducing programmability, automated verification, and innovative debugging tools, Software-Defined Networks (SDNs) are poised to meet the increasingly stringent dependability requirements of today's communication networks. However, the design of fault-tolerant SDNs remains an open challenge.This paper considers the design of dependable SDNs through the lenses of self-stabilizationa very strong notion of fault-tolerance. In particular, we develop algorithms for an in-band and distributed control plane for SDNs, called Renaissance, which tolerate a wide range of (concurrent) controller, link, and communication failures. Our self-stabilizing algorithms ensure that after the occurrence of an arbitrary combination of failures, (i) every non-faulty SDN controller can reach any switch (or another controller) in the network within a bounded communication delay (in the presence of a bounded number of concurrent failures) and (ii) every switch is managed by at least one controller (as long as at least one controller is not faulty).We evaluate Renaissance through a rigorous worst-case analysis as well as a prototype implementation (based on OVS and Floodlight), and we report on our experiments using Mininet. IntroductionContext and Motivation. Software-Defined Network (SDN) technologies have emerged as a promising alternative to the vendor-specific, complex, and hence error-prone, operation of traditional communication networks. In particular, by outsourcing and consolidating the control over the data plane elements to a logically centralized software, SDNs support a programmatic verification and enable new debugging tools. Furthermore, the decoupling of the control plane from the data plane, allows the former to evolve independently of the constraints of the latter, enabling faster innovations.However, while the literature articulates well the benefits of the separation between control and data plane and the need for distributing the control plane (e.g., for performance and fault-tolerance), the question of how connectivity between these two planes is maintained (i.e., the communication channels from controllers to switches and between controllers) has not received much attention. Providing such connectivity is critical for ensuring the availability and robustness of SDNs.Guaranteeing that each switch is managed, at any time, by at least one controller is challenging especially if control is in-band, i.e., if control and data traffic is forwarded along the same links and devices and hence arrives at the same ports. In-band control is desirable as it avoids the need to 1 arXiv:1712.07697v2 [cs.NI] 26 Feb 2019 build, operate, and ensure the reliability of a separate out-of-band management network. Moreover, in-band management can in principle improve the resiliency of a network, by leveraging a higher path diversity (beyond connectivity to the management port).The goal of this paper is the design of a highly fault-tolerant distributed and in-band control plane for SDNs. In particular, we aim to develop a self-stabilizi...

show abstract

Self-stabilizing Reconfiguration

Cited by 7 publications

References 24 publications

Self-stabilization Overhead: A Case Study on Coded Atomic Storage

Self-stabilization Overhead: A Case Study on Coded Atomic Storage

Self-Stabilizing Snapshot Objects for Asynchronous Failure-Prone Networked Systems

Renaissance: A Self-Stabilizing Distributed SDN Control Plane

Contact Info

Product

Resources

About