Checkpoint-based fault-tolerant (FT) methods have been widely used to enhance the reliability of stream processing systems, but a checkpointing process usually introduces considerable overhead. It is a critical issue to choose the optimal checkpoint interval (OCI) that maximizes the processing efficiency. Traditional OCI models consider the recovery time equals to the execution time from the last checkpoint to the failure moment. However, for stream processing jobs, the recovery time is related to reprocessing workloads, depending on the real-time input data before a failure. A new model is needed to choose the OCI for stream processing applications.Moreover, the input data rate of a stream processing job fluctuates over time. To solve these problems, we present a novel DSPS OCI (DOCI) model in this paper. We prove that it maximizes the processing efficiency for a given time. We propose an approach to dynamically adjust the OCI for an application to accommodate the workload fluctuations. We conduct simulation experiments to verify the effectiveness of our DOCI model and the efficiency of the online OCI adjustment algorithm. Experimental results with a real-world dataset show that DOCI achieves an improvement on system efficiency by up to 32%, compared with existing FT approaches.
KEYWORDSdistributed stream processing, fault tolerance, optimal checkpoint interval
INTRODUCTIONAs users of ''big data'' applications expect fresh results, the ability to process large volumes of fast data streams in a timely fashion has become increasingly important for distributed stream processing engines (DSPEs). 1 As the DSPE frameworks are scaling to large-scale clusters, eg, Apache S4, 2 Storm, 3 Spark Streaming, 4 and Flink, 5 the stream processing applications are more vulnerable to system failures, and the fault-tolerant (FT) issue is attracting more attentions.Existing stream processing systems have managed their reliability either by active replication 6 or passive replication. 5,7 Active replication offers seamless failure recovery by switching among active backup tasks, but it introduces at least doubling resource consumptions, ie, a 50% reduction in efficiency. In contrast, passive replication can improve the system efficiency significantly, 8 which has been widely used in DSPEs for recent years. It reduces resource overhead by periodically checkpointing task processing states and saves communication overhead by combining with upstream backup. 9For DSPEs with passive replication, the optimal checkpoint interval (OCI) is the key to ensure high efficiency of stream processing applications.There is a trade-off between FT overhead and the runtime cost of failure recovery: a deficient checkpointing risk at the expense of longer failure recovery, while the frequent checkpointing incurs prohibitive impact on the normal stream processing.Concurrency Computat Pract Exper. 2019;31:e5347. wileyonlinelibrary.com/journal/cpe