Benchmarking scalability of stream processing frameworks deployed as microservices in the cloud

Henning, Sören; Hasselbring, Wilhelm

doi:10.1016/j.jss.2023.111879

Cited by 5 publications

(8 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Stream processing frameworks perform operations such as filterings, transformations, or aggregations in near-real time on continuous streams of data [19]. State-of-the-art frameworks are designed for high throughput and low-latency processing, while also scaling with massive amounts of data [9,16]. To address these requirements, they run in a distributed fashion on commodity hardware.…”

Section: Distributed Stream Processingmentioning

confidence: 99%

“…Table 1 provided an overview of these benchmarks and a comparison with ShuffleBench. For a systematic and comprehensive review of the literature on stream processing benchmarking, we refer to our recent studies [16,37].…”

Section: Benchmarking Stream Processing Frameworkmentioning

confidence: 99%

“…Sustainable throughput [21] is defined as the maximum load a system can sustain without violating performance goals. Such a performance goal can be a limit on the event latency [21,34] or the maximum tolerable increase in the number of queued messages [13,16,34]. In ShuffleBench, sustainable throughput is measured by running multiple independent experiments, in which the generated load is increased from experiment to experiment and performance goals are evaluated [12].…”

Section: Throughputmentioning

confidence: 99%

“…For the ad-hoc throughput experiments, we generate 1 million records per second and monitor how many are processed per second by the frameworks. For the sustainable throughput experiments, we generate records with different frequencies and determine the maximum frequency at which the number of queued records in the Kafka input topic does not substantially increase over time (the performance goal, see our previous work for a detailed explanation of this method [16]). In the following, we first discuss the results of Flink, Kafka Streams, and Hazelcast as they allow for a straightforward interpretation, followed by a more detailed discussion of the results for Spark.…”

Section: Baseline Evaluation Of Throughputmentioning

confidence: 99%

“…We illustrate this use case inspired by requirements of a large cloud observability platform, where potentially thousands or millions of stateful black-box software components have to receive and process selected data records. The literature is currently missing a well-defined evaluation method for this use case and previous work found that the performance of stream processing frameworks highly depends on the use case [16].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks

Henning,

Vogel,

Leichtfried

et al. 2024

Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering

Self Cite

View full text Add to dashboard Cite

Distributed stream processing frameworks help building scalable and reliable applications that perform transformations and aggregations on continuous data streams. This paper introduces Shuf-fleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks. In contrast to other benchmarks, it focuses on use cases where stream processing frameworks are mainly employed for shuffling (i.e., re-distributing) data records to perform state-local aggregations, while the actual aggregation logic is considered as black-box software components. ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform and takes up benchmarking metrics and methods for latency, throughput, and scalability established in the performance engineering research community. Although inspired by a real-world observability use case, it is highly configurable to allow domain-independent evaluations. ShuffleBench comes as a ready-to-use open-source software utilizing existing Kubernetes tooling and providing implementations for four stateof-the-art frameworks. Therefore, we expect ShuffleBench to be a valuable contribution to both industrial practitioners building stream processing applications and researchers working on new stream processing approaches. We complement this paper with an experimental performance evaluation that employs ShuffleBench with various configurations on Flink, Hazelcast, Kafka Streams, and Spark in a cloud-native environment. Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.

show abstract

Section: Distributed Stream Processingmentioning

confidence: 99%

Section: Benchmarking Stream Processing Frameworkmentioning

confidence: 99%