Wait-free implementations of shared objects tolerate the failure of processes, but not the failure of base objects from which they are implemented. We consider the problem of implementing shared objects that tolerate the failure of both processes and base objects.
We identify two classes of object failures:
responsive
and
nonresponsive
. With responsive failures, a faulty object responds to every operation, but its responses may be incorrect. With nonresponsive failures, a faulty object may also “hang” without responding. In each class, we define
crash, omission,
and
arbitrary
modes of failure.
We show that all responsive failure modes can be tolerated. More precisely, for all responsive failure modes ℱ, object types
T
, and
t
≥ 0, we show how to implement a shared object of type
T
which is
t
-tolerant for ℱ. Such an object remains correct and wait-free even if up to
t
base objects fail according to ℱ. In contrast to responsive failures, we show that even the most benign non-responsive failure mode cannot be tolerated. We also show that randomization can be used to circumvent this impossibility result.
Graceful degradation
is a desirable property of fault-tolerant implementations: the implemented object never fails more severely than the base objects it is derived from, even if all the base objects fail. For several failure modes, we show wheter this property can be achieved, and, if so, how.
Over the past decade, a pair of synchronization instructions known as LL/SC has emerged as the most suitable set of instructions to be used in the design of lock-free algorithms. However, no existing multiprocessor system supports these instructions in hardware. Instead, most modern multiprocessors support instructions such as CAS or RLL/RSC (e.g. POWER4, MIPS, SPARC, IA-64). This paper presents two efficient algorithms that implement 64-bit LL/SC from 64-bit CAS or RLL/RSC. Our re~ults are summarized as follows.We present a practical algorithm for implementing a 64-bit LL/SC object from 64-bit CAS or RLL/RSC objects. Our result shows, for the first time, a practical way of simulating a 64-bit LL/SC memory word using 64-bit CAS memory words (or 64-bit RLL/RSC memory words), incurring only a small constant space overhead per process and a small constant factor slowdown.Although our first solution performs correctly in any practical system, its theoretical correctness depends on unbounded sequence numbers. We present a bounded algorithm that implements a 64-bit LL/SC object from 64-bit CAS or RLL/RSC objects, and has the same time and space complexities as the first algorithm.This and the previous algorithm improve on existing implementations of LL/SC objects by
Several basic problems that arise in fault-tolerant distributed computing were shown to have a weakest failure detector. We show here that every problem that is solvable with a failure detector has a weakest failure detector.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.