The capability of dynamically adapting to distinct runtime conditions is an important issue when designing distributed systems where negotiated quality of service (QoS) cannot always be delivered between processes. Providing fault tolerance for such dynamic environments is a challenging task. Considering such a context, this paper proposes an adaptive programming model for fault-tolerant distributed computing, which provides upper-layer applications with process state information according to the current system synchrony (or QoS). The underlying system model is hybrid, composed by a synchronous part (where there are time bounds on processing speed and message delay) and an asynchronous part (where there is no time bound). However, such a composition can vary over time, and, in particular, the system may become totally asynchronous (e.g., when the underlying system QoS degrade) or totally synchronous. Moreover, processes are not required to share the same view of the system synchrony at a given time. To illustrate what can be done in this programming model and how to use it, the consensus problem is taken as a benchmark problem. This paper also presents an implementation of the model that relies on a negotiated quality of service (QoS) for communication channels.
Abstract-In this paper we show that it is possible to implement a perfect failure detector P (one that detects all faulty processes if and only if those processes failed) in a non-synchronous distributed system. To realize that, we introduce the partitioned synchronous system (Spa) that is weaker than the conventional synchronous system. From some properties we introduce (such as strong partitioned synchrony) that must be valid in Spa and a trivially implementable Timeliness oracle, we show how to implement P in Spa. Moreover, we show that even if strong partitioned synchrony is not valid, we are still able to take advantage of the existing synchronous partitions for improving the robustness of applications, by introducing a partially perfect failure detector named xP . We also discuss how applications can benefit from these failure detectors and present some related experimental data. The necessary properties and algorithms for implementing P and xP are presented in the paper, as well as the related correctness proofs.
No presente artigo mostramos que um detector perfeito de defeitos P (que suspeita todos os processos que falharam se e somente os mesmos falharam) pode ser implementado num sistema mais fraco que o sistema distribuído síncrono (contrariando uma crença estabelecida). Nesse sentido, introduzimos o sistema síncrono particionado (Spa) que é estritamente mais fraco que o sistema síncrono (em Spa não é sempre possível implementar ações síncronas globais como sincronização interna de relógios). Através da propriedade que definimos como sincronia particionada forte, mostramos como implementar P em Spa. Melhor ainda, mostramos que mesmo que Sincronia Particionada Forte não possa ser garantida, podemos ainda assim tirar proveito das partições síncronas existentes para melhorar a robustez das aplicações de tolerância da falhas, através de um detector parcialmente perfeito, denominado por nós de xP. As propriedades e algoritmos necessários para implementar P e xP são introduzidos no artigo, assim como as provas de correção relacionadas.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.