Stream Data Load Prediction for Resource Scaling Using Online Support Vector Regression

Hu, Zhigang; Kang, Hui; Zheng, Mingming

doi:10.3390/a12020037

Cited by 12 publications

(8 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3) TWRES: We employ a second dynamic scaling baseline inspired from recent related work. Precisely, we use the resource scaling algorithm (TWRES) proposed in [10] for Spark streaming jobs. Similar to Phoebe, this algorithm requires profiling data, and scales a data processing application under consideration of workload forecasts, a performance model for the maximum processing capacity of individual scaleouts, and formulated latency constraints.…”

Section: B Phoebe Setupmentioning

confidence: 99%

See 1 more Smart Citation

Phoebe: QoS-Aware Distributed Stream Processing through Anticipating Dynamic Workloads

Geldenhuys

Scheinert

Kao

et al. 2022

2022 IEEE International Conference on Web Services (ICWS)

View full text Add to dashboard Cite

Distributed Stream Processing systems have become an essential part of big data processing platforms. They are characterized by the high-throughput processing of near to realtime event streams with the goal of delivering low-latency results and thus enabling time-sensitive decision making. At the same time, results are expected to be consistent even in the presence of partial failures where exactly-once processing guarantees are required for correctness. Stream processing workloads are oftentimes dynamic in nature which makes static configurations highly inefficient as time goes by. Static resource allocations will almost certainly either negatively impact upon the Quality of Service and/or result in higher operational costs.In this paper we present Phoebe, a proactive approach to system auto-tuning for Distributed Stream Processing jobs executing on dynamic workloads. Our approach makes use of parallel profiling runs, QoS modeling, and runtime optimization to provide a general solution whereby configuration parameters are automatically tuned to ensure a stable service as well as alignment with recovery time Quality of Service targets. Phoebe makes use of Time Series Forecasting to gain an insight into future workload requirements thereby delivering scaling decisions which are accurate, long-lived, and reliable. Our experiments demonstrate that Phoebe is able to deliver a stable service while at the same time reducing resource over-provisioning.

show abstract

Section: B Phoebe Setupmentioning

confidence: 99%

“…1) Top Speed Windowing (TSW) Experiment: For the first experiment, a DSP job was selected from the official Flink repository 10 . It was modified so that sources consumed events from and sinks published results to separate Apache Kafka topics.…”

Section: Streaming Jobsmentioning

confidence: 99%

Phoebe: QoS-Aware Distributed Stream Processing through Anticipating Dynamic Workloads

Geldenhuys

Scheinert

Kao

et al. 2022

2022 IEEE International Conference on Web Services (ICWS)

View full text Add to dashboard Cite

show abstract

“…6) Avazu: This dataset is created by using a click-through rate prediction dataset from Kaggle 9 , aggregating the clicks per hour over time, and linearly interpolating between the aggregated values to obtain different sampling rates.…”

Section: Time Series Datasetsmentioning

confidence: 99%

“…However, none has been directly compared under our defined requirements for performing TSF in DSP systems, such as minimal configuration and limited model inputs. In the context of DSP, TSF methods have been used in diverse forms and for varying reasons [9], [35]- [38]. While previous works successfully apply a selected method to a concrete problem, to the best of our knowledge, there is no related work that compares multiple TSF methods for DSP.…”

Section: Related Workmentioning

confidence: 99%

“…Prominent applications within this domain include: the automated dynamic scaling of resources which aims to reduce over-and under-provisioning, minimizing operating costs and preventing possible reductions in the service quality [6]- [9]; The live migration of functionality/state across the network where cluster metrics are collected, processed, and scanned for anomalies which might reveal performance degradations and/or signs of component failure [10]- [12]; and the automatic system tuning where dynamic runtime adjustment of system configurations are performed in order to improve overall system availability and reliability [13]- [15]. The majority of these approaches rely on coarse-grained metrics to reactively make remediation decisions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Evaluation of Load Prediction Techniques for Distributed Stream Processing

Gontarska¹,

Geldenhuys²,

Scheinert³

et al. 2021

Preprint

View full text Add to dashboard Cite

Distributed Stream Processing (DSP) systems enable processing large streams of continuous data to produce results in near to real time. They are an essential part of many data-intensive applications and analytics platforms. The rate at which events arrive at DSP systems can vary considerably over time, which may be due to trends, cyclic, and seasonal patterns within the data streams. A priori knowledge of incoming workloads enables proactive approaches to resource management and optimization tasks such as dynamic scaling, live migration of resources, and the tuning of configuration parameters during run-times, thus leading to a potentially better Quality of Service.In this paper we conduct a comprehensive evaluation of different load prediction techniques for DSP jobs. We identify three use-cases and formulate requirements for making load predictions specific to DSP jobs. Automatically optimized classical and Deep Learning methods are being evaluated on nine different datasets from typical DSP domains, i.e. the IoT, Web 2.0, and cluster monitoring. We compare model performance with respect to overall accuracy and training duration. Our results show that the Deep Learning methods provide the most accurate load predictions for the majority of the evaluated datasets.

show abstract