Big data analytics on Apache Spark

Salloum, Salman; Dautov, Ruslan; Chen, Xiaojun; Peng, Patrick Xiaogang; Huang, Joshua Zhexue

doi:10.1007/s41060-016-0027-9

Cited by 327 publications

(152 citation statements)

References 58 publications

Supporting

Mentioning

148

Contrasting

Unclassified

Order By: Relevance

“…In other words, new batches are created from input DStreams depending on the batch interval length and those discrete streams are stored in memory as RDD sequences. The RDDs are then executed by generating Spark jobs . Figure shows the architectural overview of the Spark Streaming scheme.…”

Section: Real‐time Sentiment Prediction Frameworkmentioning

confidence: 99%

A spark‐based big data analysis framework for real‐time sentiment prediction on streaming data

Kılınç

2019

Softw Pract Exp

View full text Add to dashboard Cite

Summary There are many data sources that produce large volumes of data. The Big Data nature requires new distributed processing approaches to extract the valuable information. Real‐time sentiment analysis is one of the most demanding research areas that requires powerful Big Data analytics tools such as Spark. Prior literature survey work has shown that, though there are many conventional sentiment analysis researches, there are only few works realizing sentiment analysis in real time. One major point that affects the quality of real‐time sentiment analysis is the confidence of the generated data. In more clear terms, it is a valuable research question to determine whether the owner that generates sentiment is genuine or not. Since data generated by fake personalities may decrease accuracy of the outcome, a smart/intelligent service that can identify the source of data is one of the key points in the analysis. In this context, we include a fake account detection service to the proposed framework. Both sentiment analysis and fake account detection systems are trained and tested using Naïve Bayes model from Apache Spark's machine learning library. The developed system consists of four integrated software components, ie, (i) machine learning and streaming service for sentiment prediction, (ii) a Twitter streaming service to retrieve tweets, (iii) a Twitter fake account detection service to assess the owner of the retrieved tweet, and (iv) a real‐time reporting and dashboard component to visualize the results of sentiment analysis. The sentiment classification performances of the system for offline and real‐time modes are 86.77% and 80.93%, respectively.

show abstract

Section: Real‐time Sentiment Prediction Frameworkmentioning

confidence: 99%

A spark‐based big data analysis framework for real‐time sentiment prediction on streaming data

Kılınç

2019

Softw Pract Exp

View full text Add to dashboard Cite

show abstract

“…Spark is an open source framework for distributed computing [17]. It is a set of tools and software components structured according to a defined architecture.…”

Section: ) Only Suitable For Processing Data On Batch 2) No Real Timmentioning

confidence: 99%

A New Architecture for Real Time Data Stream Processing

Ounacer¹,

Talhaoui²,

Ardchir³

et al. 2017

ijacsa

View full text Add to dashboard Cite

Abstract-Processing a data stream in real time is a crucial issue for several applications, however processing a large amount of data from different sources, such as sensor networks, web traffic, social media, video streams and other sources, represents a huge challenge. The main problem is that the big data system is based on Hadoop technology, especially MapReduce for processing. This latter is a high scalability and fault tolerant framework. It also processes a large amount of data in batches and provides perception blast insight of older data, but it can only process a limited set of data. MapReduce is not appropriate for real time stream processing, and is very important to process data the moment they arrive at a fast response and a good decision making. Ergo the need for a new architecture that allows real-time data processing with high speed along with low latency. The major aim of the paper at hand is to give a clear survey of the different open sources technologies that exist for real-time data stream processing including their system architectures. We shall also provide a brand new architecture which is mainly based on previous comparisons of real-time processing powered with machine learning and storm technology.

show abstract

“…This is because a MapReduce jobs need I/O disk operations to shuffle and sort the data during the Map and Reduce phases. Furthermore, Apache Spark provides rich APIs in several languages (Java, Scala, Python, and R) for developers to choose from in order to perform complex operations on distributed RDDs …”

Section: Introductionmentioning

confidence: 99%

“…During the running of the spark application, the driver program monitors the executors and sends the tasks to the executors to run in multi‐thread mode. The spark application keeps running until the spark context's stop method is invoked or the main function of the application is finished …”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Parallel particle swarm optimization classification algorithm variant implemented with Apache Spark

Al‐Sawwa

Ludwig

2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary With the rapid development of technologies such as the internet, the amount of data that are collected or generated in many areas such as in the agricultural, biomedical, and finance sectors poses challenges to the scientific community because of the volume and complexity of the data. Furthermore, the need of analysis tools that extract useful information for decision support has been receiving more attention in order for researchers to find a scalable solution to traditional algorithms. In this paper, we proposed a scalable design and implementation of a particle swarm optimization classification (SCPSO) approach that is based on the Apache Spark framework. The main idea of the SCPSO algorithm is to find the optimal centroid for each target label using particle swarm optimization and then assign unlabeled data points to the closest centroid. Two variants of SCPSO, SCPSO‐F1 and SCPSO‐F2, were proposed based on different fitness functions, which were tested on real data sets in order to evaluate their scalability and performance. The experimental results revealed that SCPSO‐F1 and SCPSO‐F2 scale very well with increasing data set sizes and the speedup of SCPSO‐F2 is almost identical to the linear speedup while the speedup of SCPSO‐F1 is very close to the linear speedup. Thus, SCPSO‐F1 and SCPSO‐F2 can be efficiently parallelized using the Apache Spark framework.

show abstract

Big data analytics on Apache Spark

Cited by 327 publications

References 58 publications

A spark‐based big data analysis framework for real‐time sentiment prediction on streaming data

A spark‐based big data analysis framework for real‐time sentiment prediction on streaming data

A New Architecture for Real Time Data Stream Processing

Parallel particle swarm optimization classification algorithm variant implemented with Apache Spark

Contact Info

Product

Resources

About