Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy

Tang, Kun; Tiwari, Devesh; Gupta, Saurabh; Huang, Ping; Lü, Qi; Engelmann, Christian; He, Xubin

doi:10.1109/dsn.2016.36

Cited by 16 publications

(2 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare DOCI with a static OCI model, where the optimum checkpoint interval is modeled as

m^{opt} = \sqrt{2 t_{c} M T T F}

. It has been a useful approximation model of the optimum checkpoint interval . In addition, to validate our algorithm of determining dynamic OCI adjustment periods, we compare the proposed DOCI‐OSA algorithm (based on the bottom‐up (BU) strategy) with our existing gradient‐based DOCI adjustment algorithm (a top‐down (TD) strategy) given by our previous work, which are referred as BU DOCI and TD DOCI in the following section, respectively.…”

Section: Simulation Resultsmentioning

confidence: 99%

An optimal checkpointing model with online OCI adjustment for stream processing applications

Zhuang

Wei

Li³

et al. 2019

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Checkpoint-based fault-tolerant (FT) methods have been widely used to enhance the reliability of stream processing systems, but a checkpointing process usually introduces considerable overhead. It is a critical issue to choose the optimal checkpoint interval (OCI) that maximizes the processing efficiency. Traditional OCI models consider the recovery time equals to the execution time from the last checkpoint to the failure moment. However, for stream processing jobs, the recovery time is related to reprocessing workloads, depending on the real-time input data before a failure. A new model is needed to choose the OCI for stream processing applications.Moreover, the input data rate of a stream processing job fluctuates over time. To solve these problems, we present a novel DSPS OCI (DOCI) model in this paper. We prove that it maximizes the processing efficiency for a given time. We propose an approach to dynamically adjust the OCI for an application to accommodate the workload fluctuations. We conduct simulation experiments to verify the effectiveness of our DOCI model and the efficiency of the online OCI adjustment algorithm. Experimental results with a real-world dataset show that DOCI achieves an improvement on system efficiency by up to 32%, compared with existing FT approaches. KEYWORDSdistributed stream processing, fault tolerance, optimal checkpoint interval INTRODUCTIONAs users of ''big data'' applications expect fresh results, the ability to process large volumes of fast data streams in a timely fashion has become increasingly important for distributed stream processing engines (DSPEs). 1 As the DSPE frameworks are scaling to large-scale clusters, eg, Apache S4, 2 Storm, 3 Spark Streaming, 4 and Flink, 5 the stream processing applications are more vulnerable to system failures, and the fault-tolerant (FT) issue is attracting more attentions.Existing stream processing systems have managed their reliability either by active replication 6 or passive replication. 5,7 Active replication offers seamless failure recovery by switching among active backup tasks, but it introduces at least doubling resource consumptions, ie, a 50% reduction in efficiency. In contrast, passive replication can improve the system efficiency significantly, 8 which has been widely used in DSPEs for recent years. It reduces resource overhead by periodically checkpointing task processing states and saves communication overhead by combining with upstream backup. 9For DSPEs with passive replication, the optimal checkpoint interval (OCI) is the key to ensure high efficiency of stream processing applications.There is a trade-off between FT overhead and the runtime cost of failure recovery: a deficient checkpointing risk at the expense of longer failure recovery, while the frequent checkpointing incurs prohibitive impact on the normal stream processing.Concurrency Computat Pract Exper. 2019;31:e5347. wileyonlinelibrary.com/journal/cpe

show abstract

“…We compare DOCI with a static OCI model, where the optimum checkpoint interval is modeled as

m^{opt} = \sqrt{2 t_{c} M T T F}

Section: Simulation Resultsmentioning

confidence: 99%

An optimal checkpointing model with online OCI adjustment for stream processing applications

Zhuang

Wei

Li³

et al. 2019

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…If knowledge about failures can be predicted via heuristics, one can deploy an autonomous system that can schedule resources and actions in order to maximize the overall system performance, as argued in [2]. For example, failure recovery measures such as checkpoint saving can be used to reduce the cost from system failures [3] [4].…”

Section: Introductionmentioning

confidence: 99%

Spatiotemporal Real-Time Anomaly Detection for Supercomputing Systems

Kang

Agrawal

Choudhary

et al. 2019

2019 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

The demands of increasingly large scientific application workflows lead to the need for more powerful supercomputers. As the scale of supercomputing systems have grown, the prediction of fault tolerance has become an increasingly critical area of study, since the prediction of system failures can improve performance by saving checkpoints in advance. We propose a real-time failure detection algorithm that adopts an event-based prediction model. The prediction model is a convolutional neural network that utilizes both traditional event attributes and additional spatio-temporal features. We present a case study using our proposed method with six years of reliability, availability, and serviceability event logs recorded by Mira, a Blue Gene/Q supercomputer at Argonne National Laboratory. In the case study, we have shown that our failure prediction model is not limited to predict the occurrence of failures in general. It is capable of accurately detecting specific types of critical failures such as coolant and power problems within reasonable lead time ranges. Our case study shows that the proposed method can achieve a F1 score of 0.56 for general failures, 0.97 for coolant failures, and 0.86 for power failures.

show abstract