Abstract. Improving the dependability of computer systems is a critical and essential task. In this context, the paper surveys techniques that allow to achieve fault tolerance in distributed systems by replication. The main replication techniques are first explained. Then group communication is introduced as the communication infrastructure that allows the implementation of the different replication techniques. Finally the difficulty of implementing group communication is discussed, and the most important algorithms are presented.
IntroductionComputer systems become every day more and more complex. As a consequence the probability of problems in these systems increases over the years. To avoid this from becoming a major issue, researchers have since many years worked on improving the dependability of these systems. The methods involved are traditionally classified as fault prevention, fault tolerance, fault removal and fault forecasting [23]. Fault prevention refers to methods for preventing the occurrence or the introduction of faults in the system. Fault tolerance refers to methods allowing the system to provide a service complying with the specification in spite of faults. Fault removal refers to methods for reducing the number and the severity of faults. Fault forecasting refers to methods for estimating the presence of faults (with the goal to locate and remove them). We concentrate here on fault tolerance.Several techniques to achieve fault tolerance have been developed over the years. The different techniques are related to the specificity of applications. For example, a centralized application differs from a distributed application involving several computing systems. We consider here distributed applications. Fault tolerance for distributed applications can be achieved with different techniques: transactions, checkpointing and replication.Transactions have been introduced many years ago in the context of database systems [3]. A transaction allows us to group a sequence of operations while ensuring some properties on these operations, called ACID properties [3]: Atomicity, Consistency, Isolation and Durability. Atomicity requires that either all Almost the same paper appears under the title Group Communication: from practice to theory in