Benchmarking Distributed Stream Data Processing Systems

Karimov, Jeyhun; Rabl, Tilmann; Katsifodimos, Asterios; Samarev, Roman; Heiskanen, Henri; Markl, Volker

doi:10.1109/icde.2018.00169

Cited by 162 publications

(134 citation statements)

References 20 publications

Supporting

Mentioning

133

Contrasting

Order By: Relevance

“…For example, the number of deployed nodes alone can have a different impact on each of the considered frameworks. In our case, we kept a rather simple nodes topology, but for more complex topologies the results can be different (as previous research has shown, e.g., [37]). Another internal threat to validity is given by the implementation differences of the algorithm on each platform.…”

Section: Resultsmentioning

confidence: 98%

See 1 more Smart Citation

Big Data Platform for Smart Grids Power Consumption Anomaly Detection

Lipcak¹,

Macák²,

Rossi³

2019

Annals of Computer Science and Information Systems

View full text Add to dashboard Cite

Big data processing in the Smart Grid context has many large-scale applications that require real-time data analysis (e.g., intrusion and data injection attacks detection, electric device health monitoring). In this paper, we present a big data platform for anomaly detection of power consumption data. The platform is based on an ingestion layer with data densification options, Apache Flink as part of the speed layer and HDFS/KairosDB as data storage layers. We showcase the application of the platform to a scenario of power consumption anomaly detection, benchmarking different alternative frameworks used at the speed layer level (Flink, Storm, Spark).

show abstract

Section: Resultsmentioning

confidence: 98%

“…Karimov et al [37] found Flink to have more than three times faster throughput than Spark and Storm for aggregations. Joins were more than two times faster for Flink than Spark.…”

Section: A Compared Frameworkmentioning

confidence: 99%

Big Data Platform for Smart Grids Power Consumption Anomaly Detection

Lipcak¹,

Macák²,

Rossi³

2019

Annals of Computer Science and Information Systems

View full text Add to dashboard Cite

show abstract

“…Following table-1 shows the comparative analysis of distinct tools and techniques to handle the issues of latency and throughput. [4] Buffering mechanism DSP Engine [6] Queue of Events before processing Google Data Flow [5] Watermark & Trigger…”

Section: B Event Time Windowmentioning

confidence: 99%

Experimental Analysis on Processing of Unbounded Data

Bhatt¹,

Thakkar²

2019

IJITEE

View full text Add to dashboard Cite

Processing of unordered and unbounded data is the prime requirement of the current businesses. Large amount of rapidly generated data demands the processing of the same without the storage and as per the timestamp associated with it. It is difficult to process these unbounded data with batch engine as the existing batch systems suffer from the delay intrinsic by accumulating entire incoming records in a group prior to process it. However windowing can be useful when dealing with unbounded data which pieces up a dataset into fixed chunks for processing with repeated runs of batch engine. Contrast to batch processing, stream handling system aims to process information that is gathered in a little timeframe. In this way, stream data processing ought to be coordinated with the flow of data. In the real world the event time is always skewed with the processing time which introduce issues of delay and completeness in incoming stream of data. In this paper, we presented the analysis on the watermark and trigger approach which can be used to manage these unconventional desires in the processing of unbounded data.

show abstract

“…Several solutions are available to handle this problem [4]. Distributed computing is one possible solution [5], and become the most efficient and fault-tolerant method for companies to store and process massive amounts of data. Among this new group of tools, MapReduce and Spark are the most commonly used cluster computing tools.…”

Section: Introductionmentioning

confidence: 99%

A Comprehensive Performance Analysis of Apache Hadoop and Apache Spark for Large Scale Data Sets Using HiBench

Ahmed

Barczak

Sušnjak

et al. 2020

Preprint

View full text Add to dashboard Cite

In recent times Big Data analytics has got tremendous attention and it involves storing, processing, and analysing large scale datasets. The advent of distributed computing frameworks such as Hadoop and Spark offers an efficient solution to analyse vast amounts of data. Due to the availability of an application program ming interface (API) and its performance, Spark become very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters and the combination of these parameters have a huge impact on cluster performance. The system default parameters help the system administrator to deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance? In this regard, our study investigates the most impacting parameters such as input splits and shuffling, in order to compare the performance between Hadoop and Spark, using a specific cluster implemented in our department. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks comparison and analysis, we select two work- loads: WordCount and TeraSort. The performance metrics are carried out based on three criteria, namely execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis results found, unsurprisingly, that Spark has better performance as compared to Hadoop, achieving up to 2 times speedup in WordCount workload and up to 14 times in TeraSort workloads when default parameters are replaced. Finally, we conclude that the system performance depends on different parameters configuration alternatives, and they depend on the data size.

show abstract

Benchmarking Distributed Stream Data Processing Systems

Cited by 162 publications

References 20 publications

Big Data Platform for Smart Grids Power Consumption Anomaly Detection

Big Data Platform for Smart Grids Power Consumption Anomaly Detection

Experimental Analysis on Processing of Unbounded Data

A Comprehensive Performance Analysis of Apache Hadoop and Apache Spark for Large Scale Data Sets Using HiBench

Contact Info

Product

Resources

About