Spark Structured Streaming: Customizing Kafka Stream Processing

Drohobytskiy, Yuriy; Brevus, Vitaly; Skorenkyy, Yuriy

doi:10.1109/dsmp47368.2020.9204304

Cited by 9 publications

(4 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A. Saraswathi et al [32] also used Kafka and Spark to predict road traffic in real-time. Y. Drohobytskiy et al [33] developed a real-time multi-party data exchange using Apache Spark to obtain data from Apache Kafka, process it and store it in HDFS.…”

Section: Data Processmentioning

confidence: 99%

Parameterization and Performance Analysis of a Scalable, near Real-Time Packet Capturing Platform

Oliveira,

Pedrosa,

Rufino

et al. 2024

Systems

View full text Add to dashboard Cite

The rapid evolution of technology has fostered an exponential rise in the number of individuals and devices interconnected via the Internet. This interconnectedness has prompted companies to expand their computing and communication infrastructures significantly to accommodate the escalating demands. However, this proliferation of connectivity has also opened new avenues for cyber threats, emphasizing the critical need for Intrusion Detection Systems (IDSs) to adapt and operate efficiently in this evolving landscape. In response, companies are increasingly seeking IDSs characterized by horizontal, modular, and elastic attributes, capable of dynamically scaling with the fluctuating volume of network data flows deemed essential for effective monitoring and threat detection. Yet, the task extends beyond mere data capture and storage; robust IDSs must integrate sophisticated components for data analysis and anomaly detection, ideally functioning in real-time or near real-time. While Machine Learning (ML) techniques present promising avenues for detecting and mitigating malicious activities, their efficacy hinges on the availability of high-quality training datasets, which in turn poses a significant challenge. This paper proposes a comprehensive solution in the form of an architecture and reference implementation for (near) real-time capture, storage, and analysis of network data within a 1 Gbps network environment. Performance benchmarks provided offer valuable insights for prototype optimization, demonstrating the capability of the proposed IDS architecture to meet objectives even under realistic operational scenarios.

show abstract

Section: Data Processmentioning

confidence: 99%

Parameterization and Performance Analysis of a Scalable, near Real-Time Packet Capturing Platform

Oliveira,

Pedrosa,

Rufino

et al. 2024

Systems

View full text Add to dashboard Cite

show abstract

“…Drohobytskiy et al [5] demonstrate customizing Kafka stream processing using Spark Structured Streaming. Conditional monitoring procedures to process irregular data streams efficiently are shown.…”

Section: Literature Surveymentioning

confidence: 99%

Harnessing Insights from Streams: Unlocking Real-Time Data Flow with Docker and Cassandra in the Apache Ecosystem

Oza,

Patil,

Maniyath

et al. 2024

Preprint

View full text Add to dashboard Cite

Real-time data streaming pipelines are immensely valuable in today’s data-driven world since they enable continuous data processing and analytics. This research paper provides a comprehensive exploration of the architecture, development, and deployment of an advanced real-time data streaming pipeline. It utilizes Docker for containerization, Apache Kafka for distributed streaming, Apache Spark for dynamic data transformation, and Cassandra for efficient NoSQL storage. The study outlines the intricacies of integrating these technologies, examining the pipeline’s components, functionalities, performance metrics, and potential applications. Through this case study, the paper showcases the efficacy of open-source tools in constructing highly scalable and resilient data streaming pipelines.

show abstract

“…Apache Kafka is a distributed streaming platform that is designed for high-throughput, fault-tolerant, and scalable data streaming [10]. Kafka is widely used for building realtime data pipelines and streaming applications, such as log aggregation, event-driven architectures, and stream processing [22].…”

Section: Apache Kafkamentioning

confidence: 99%

“…Additionally, the function configures the Spark object with necessary packages and settings, such as the Kafka and PostgreSQL connectors, to allow seamless integration with these external components. It also ensures that the streaming process can be stopped gracefully when required [10]. Upon successful creation of the Spark object, it is returned to be used in the data ingestion pipeline.…”

Section: B Consuming Data From Kafkamentioning

confidence: 99%

Exploring Real-Time Sentiment Analysis Prototype for Retail Industry

Yuen,

Juddoo

2023

iCARTi

View full text Add to dashboard Cite

The rise of social media platforms has revolutionized the way consumers interact with retailers and express their opinions on products and services. Online retailers particularly need to keep a close eye on customer sentiment in real-time to make informed decisions about their offerings and improve customer satisfaction. However, efficiently analysing large volumes of unstructured text data from social media in real-time poses a significant challenge. This research aimed to develop a scalable, real-time sentiment analysis system tailored for online retailers using Reddit as the data source. The system comprises three main components: a data extraction and streaming pipeline, a sentiment analysis model, and a web application with real-time analytics. To address the data extraction challenge, a job queue-based system was implemented using Node.js, ‘BullMQ’, and Redis to create and manage campaigns for data streaming from Reddit. The data was streamed using Kafka, a distributed streaming platform, to enable efficient real-time processing. The sentiment analysis model was developed using a Naive Bayes classifier after experimenting with other machine learning and deep learning techniques. In the conducted study, the sentiment analysis model's performance was evaluated using standard metrics tailored to the context of online retail sentiment analysis. An accuracy of 0.6737 was achieved, reflecting the model's ability to correctly classify approximately 67.37 per cent of the sentiments in the test data. Concurrently, an F1 score of 0.7894 was recorded and the Area Under the Curve (AUC) value on the test data was measured at 0.5468, a metric that, while acceptable, suggests room for further refinement in the model's discriminatory ability between classes. The integration of the Data Version Control (DVC) system provided a mechanism for fine-tuning the model according to specific data requirements of various tenants. These results, taken together, not only validate the feasibility of employing a Naive Bayes classifier for real-time sentiment analysis in the retail context, but also provide a baseline for future research aimed at enhancing both the accuracy and efficiency of sentiment classification. The project’s evaluation focused on the performance of the sentiment analysis model, the efficiency of the Kafka streaming and real-time Spark pre-processing pipeline, and the backend infrastructure, including the job queuing system and WebSocket implementation. Various evaluation techniques, such as graphs and literature comparisons, were used to assess the system’s performance. In conclusion, this project successfully demonstrated the feasibility of a scalable, real-time sentiment analysis system for online retailers using Reddit data. The system has the potential to help retailers better understand customer opinions and make data-driven decisions for their businesses. Future work could include exploring alternative data sources, experimenting with more advanced sentiment analysis techniques, and enhancing the web application’s user interface and analytics capabilities.

show abstract

Spark Structured Streaming: Customizing Kafka Stream Processing

Cited by 9 publications

References 8 publications

Parameterization and Performance Analysis of a Scalable, near Real-Time Packet Capturing Platform

Parameterization and Performance Analysis of a Scalable, near Real-Time Packet Capturing Platform

Harnessing Insights from Streams: Unlocking Real-Time Data Flow with Docker and Cassandra in the Apache Ecosystem

Exploring Real-Time Sentiment Analysis Prototype for Retail Industry

Contact Info

Product

Resources

About