Reconfigurable Atomic Transaction Commit

Bravo, Manuel; Gotsman, Alexey

doi:10.1145/3293611.3331590

Cited by 5 publications

(3 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, Centiman ensures faulttolerance differently from our protocol in Sect. 4, by largely outsourcing it to an external storage service. We believe that our method of proving correctness of transaction certification can be applied to systems like Centiman, by establishing a simulation relation that maps steps of the protocol to the corresponding steps of multi-shot 2PC.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Multi-shot distributed transaction commit

Chockler

Gotsman

2021

Distrib. Comput.

Self Cite

View full text Add to dashboard Cite

Atomic Commit Problem (ACP) is a single-shot agreement problem similar to consensus, meant to model the properties of transaction commit protocols in fault-prone distributed systems. We argue that ACP is too restrictive to capture the complexities of modern transactional data stores, where commit protocols are integrated with concurrency control, and their executions for different transactions are interdependent. As an alternative, we introduce Transaction Certification Service (TCS), a new formal problem that captures safety guarantees of multi-shot transaction commit protocols with integrated concurrency control. TCS is parameterized by a certification function that can be instantiated to support common isolation levels, such as serializability and snapshot isolation. We then derive a provably correct crash-resilient protocol for implementing TCS through successive refinement. Our protocol achieves a better time complexity than mainstream approaches that layer two-phase commit on top of Paxos-style replication.

show abstract

Section: Related Workmentioning

confidence: 99%

“…However, recovery in FaRM is designed differently than in our protocol, relying on an external reconfiguration engine. Since the conference publication of this work, the approach we propose has also been applied to the class of systems similar to FaRM [4].…”

Section: Related Workmentioning

confidence: 99%

Multi-shot distributed transaction commit

Chockler

Gotsman

2021

Distrib. Comput.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Some academic efforts study approaches to make transaction commits more performant and less heavy, e.g. [56], [57]. The Structured Streaming community has acknowledged the need for exactly-once processing semantics in its Kafka sink.…”

Section: ) Fault Tolerancementioning

confidence: 99%

A Performance Analysis of Fault Recovery in Stream Processing Frameworks

Dongen

Poel

2021

IEEE Access

View full text Add to dashboard Cite

Distributed stream processing frameworks have gained widespread adoption in the last decade because they abstract away the complexity of parallel processing. One of their key features is built-in fault tolerance. In this work, we dive deeper into the implementation, performance, and efficiency of this critical feature for four state-of-the-art frameworks. We include the established Spark Streaming and Flink frameworks and the more novel Spark Structured Streaming and Kafka Streams frameworks. We test the behavior under different types of faults and settings: master failure with and without high-availability setups, driver failures for Spark frameworks, worker failure with or without exactly-once semantics, application and task failures. We highlight differences in behavior during these failures on several aspects, e.g., whether there is an outage, downtime, recovery time, data loss, duplicate processing, accuracy, and the cost and behavior of different message delivery guarantees. Our results highlight the impact of framework design on the speed of fault recovery and explain how different use cases may benefit from different approaches. Due to their task-based scheduling approach, the Spark frameworks can recover quickly and in most cases without necessitating an application restart. Kafka Streams has the shortest downtime, while Flink can offer end-to-end exactly-once semantics at a low cost.

show abstract

Reconfigurable Atomic Transaction Commit

Bravo

Gotsman

2019

Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing

View full text Add to dashboard Cite

Modern data stores achieve scalability by partitioning data into shards and fault-tolerance by replicating each shard across several servers. A key component of such systems is a Transaction Certification Service (TCS), which atomically commits a transaction spanning multiple shards. Existing TCS protocols require 2f + 1 crash-stop replicas per shard to tolerate f failures. In this paper we present atomic commit protocols that require only f + 1 replicas and reconfigure the system upon failures using an external reconfiguration service. We furthermore rigorously prove that these protocols correctly implement a recently proposed TCS specification. We present protocols in two different models-the standard asynchronous message-passing model and a model with Remote Direct Memory Access (RDMA), which allows a machine to access the memory of another machine over the network without involving the latter's CPU. Our protocols are inspired by a recent FARM system for RDMA-based transaction processing. Our work codifies the core ideas of FARM as distributed TCS protocols, rigorously proves them correct and highlights the trade-offs required by the use of RDMA.

show abstract

Reconfigurable Atomic Transaction Commit

Cited by 5 publications

References 26 publications

Multi-shot distributed transaction commit

Multi-shot distributed transaction commit

A Performance Analysis of Fault Recovery in Stream Processing Frameworks

Reconfigurable Atomic Transaction Commit

Contact Info

Product

Resources

About