Abstract-The problem of distributed diagnosis in the presence of dynamic failures and repairs is considered. To address this problem, the notion of bounded correctness is defined. Bounded correctness is made up of three properties: bounded diagnostic latency, which ensures that information about state changes of nodes in the system reaches working nodes with a bounded delay, bounded start-up time, which guarantees that working nodes determine valid states for every other node in the system within bounded time after their recovery, and accuracy, which ensures that no spurious events are recorded by working nodes. It is shown that, in order to achieve bounded correctness, the rate at which nodes fail and are repaired must be limited. This requirement is quantified by defining a minimum state holding time in the system. Algorithm HeartbeatComplete is presented and it is proven that this algorithm achieves bounded correctness in fully-connected systems while simultaneously minimizing diagnostic latency, start-up time, and state holding time. A diagnosis algorithm for arbitrary topologies, known as Algorithm ForwardHeartbeat, is also presented. ForwardHeartbeat is shown to produce significantly shorter latency and state holding time than prior algorithms, which focused primarily on minimizing the number of tests at the expense of latency.
We describe a novel approach for building a secure and fault tolerant data storage service in collaborative work environments, which uses perfect secret sharing schemes to store data. Perfect secret sharing schemes have found little use in managing generic data because of the high computation overheads incurred by such schemes. Our proposed approach uses a novel combination of XOR secret sharing and replication mechanisms, which drastically reduce the computation overheads and achieve speeds comparable to standard encryption schemes. The combination of secret sharing and replication manifests itself as an architectural framework, which has the attractive property that its dimension can be varied to exploit tradeoffs amongst different performance metrics. We evaluate the properties and performance of the proposed framework and show that the combination of perfect secret sharing and replication can be used to build efficient fault-tolerant and secure distributed data storage systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.