Vector clock algorithms are basic wait-free building blocks that facilitate causal ordering of events. As wait-free algorithms, they are guaranteed to complete their operations within a finite number of steps. Stabilizing algorithms allow the system to recover after the occurrence of transient faults, such as soft errors and arbitrary violations of the assumptions according to which the system was designed to behave. We present the first, to the best of our knowledge, stabilizing vector clock algorithm for asynchronous crash-prone message-passing systems that can recover in a wait-free manner after the occurrence of transient faults. In these settings, it is challenging to demonstrate a finite and wait-free recovery from (communication and crash failures as well as) transient faults, bound the message and storage sizes, deal with the removal of all stale information without blocking, and deal with counter overflow events (which occur at different network nodes concurrently).We present an algorithm that never violates safety in the absence of transient faults and provides bounded time recovery during fair executions that follow the last transient fault. The novelty is that in the absence of execution fairness, the algorithm guarantees a bound on the number of times in which the system might violate safety (while existing algorithms might block forever due to the presence of both transient faults and crash failures).Since vector clocks facilitate a number of elementary synchronization building blocks (without requiring remote replica synchronization) in asynchronous systems, we believe that our analytical insights are useful for the design of other systems that cannot guarantee execution fairness.
Context and Motivation.Vector clocks allow reasoning about causality among events in distributed systems, for example, when constructing distributed snapshots [17]. Shapiro et al. [24] showed that vector clocks are building blocks of several conflict-free replicated data types (CRDTs). CRDTs are distributed data structures that can be shared among many replicas in asynchronous networks. All replica updates occur independently and achieve strong eventual consistency without using mechanisms for synchronization [25] or roll-back.The industrial use of CRDTs includes globally distributed databases, such as the ones of Redis, Riak, Bet365, SoundCloud, TomTom, Phoenix, and Facebook. Some of these databases have around ten million concurrent users, ten thousand messages per second, store large volumes of data, and offer very low latency. However, while both the literature and the users demonstrate that large-scale decentralized systems can benefit from the use of CRDTs in general and vector clocks in particular, the relationship between fault-tolerance and strong eventual consistency has not received sufficient attention. Providing higher robustness degrees to CRDTs is nevertheless imperative for ensuring the availability and safety of these systems.Providing robustness in the presence of unexpected failures, i.e., the ones that...