Byzantine state-machine replication (SMR) ensures the consistency of replicated state in the presence of malicious replicas and lies at the heart of the modern blockchain technology. Byzantine SMR protocols often guarantee safety under all circumstances and liveness only under synchrony. However, guaranteeing liveness even under this assumption is nontrivial. So far we have lacked systematic ways of incorporating liveness mechanisms into Byzantine SMR protocols, which often led to subtle bugs. To close this gap, we introduce a modular framework to facilitate the design of provably live and efficient Byzantine SMR protocols. Our framework relies on a view abstraction generated by a special SMR synchronizer primitive to drive the agreement on command ordering. We present a simple formal specification of an SMR synchronizer and its bounded-space implementation under partial synchrony. We also apply our specification to prove liveness and analyze the latency of three Byzantine SMR protocols via a uniform methodology. In particular, one of these results yields what we believe is the first rigorous liveness proof for the algorithmic core of the seminal PBFT protocol.
INTRODUCTIONByzantine state-machine replication (SMR) [56] ensures the consistency of replicated state even when some of the replicas are malicious. It lies at the heart of the modern blockchain technology and is closely related to the classical Byzantine consensus problem. Unfortunately, no deterministic protocol can guarantee both safety and liveness of Byzantine SMR when the network is asynchronous [37]. A common way to circumvent this while maintaining determinism is to guarantee safety under all circumstances and liveness only under synchrony. This is formalized by the partial synchrony model [29,36], which stipulates that after some unknown Global Stabilization Time (GST) the system becomes synchronous, with message delays bounded by an unknown constant 𝛿 and process clocks tracking real time. Before GST, however, messages can be lost or delayed, and clocks at different processes can drift apart.Historically, researchers have paid more attention to safety of Byzantine SMR protocols than their liveness. For example, while the seminal PBFT protocol came with a detailed safety proof [26, §A], the nontrivial mechanisms ensuring its liveness were only given a brief informal justification [28, §4.5.1], which did not cover their most critical properties. However, ensuring liveness under partial synchrony is far from trivial, as illustrated by the many liveness bugs found in existing protocols [2,4,11,15,25]. In particular, classical failure and leader detectors [29,38] are of little help: while they have been widely used under benign failures [42,43,52], their implementations under Byzantine failures are either impractical [47] or detect only restricted failure types [33,34,44,51]. As an alternative, a textbook by Cachin et al. [22] proposed a leader detector-like abstraction that accepts hints from the application to identify potentially faulty processes....