Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java

Omar, Hoger K.; Jumaa, Alaa Khalil

doi:10.24017/science.2019.1.2

Cited by 14 publications

(5 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hadoop, Spark, NO-SQL, Sklearn and Weka libraries, Hive, Cloud, and Rapid Miner technologies are gaining popularity. These technologies are computer software tools for extracting, managing, and analyzing data from a massively complex and large data collection that traditional management tools would never be able to handle [29][30][31][32][33][34]. However, in such a setting, selecting among a variety of technologies may be time consuming and difficult.…”

Section: Big Data Technologiesmentioning

confidence: 99%

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem

Junaid

Ali

Siddiqui

et al. 2022

Wireless Pers Commun

View full text Add to dashboard Cite

Artificial intelligence, specifically machine learning, has been applied in a variety of methods by the research group to transform several data sources into valuable facts and understanding, allowing for superior pattern identification skills. Machine learning algorithms on huge and complicated data sets, computationally expensive on the other hand, processing requires hardware and logical resources, such as space, CPU, and memory. As the amount of data created daily reaches quintillion bytes, A complex big data infrastructure becomes more and more relevant. Apache Spark Machine learning library (ML-lib) is a famous platform used for big data analysis, it includes several useful features for machine learning applications, involving regression, classification, and dimension reduction, as well as clustering and features extraction. In this contribution, we consider Apache Spark ML-lib as a computationally independent machine learning library, which is open-source, distributed, scalable, and platform. We have evaluated and compared several ML algorithms to analyze the platform’s qualities, compared Apache Spark ML-lib against Rapid Miner and Sklearn, which are two additional Big data and machine learning processing platforms. Logistic Classifier (LC), Decision Tree Classifier (DTc), Random Forest Classifier (RFC), and Gradient Boosted Tree Classifier (GBTC) are four machine learning algorithms that are compared across platforms. In addition, we have tested general regression methods such as Linear Regressor (LR), Decision Tree Regressor (DTR), Random Forest Regressor (RFR), and Gradient Boosted Tree Regressor (GBTR) on SUSY and Higgs datasets. Moreover, We have evaluated the unsupervised learning methods like K-means and Gaussian Mixer Models on the data set SUSY and Hepmass to determine the robustness of PySpark, in comparison with the classification and regression models. We used ”SUSY,” ”HIGGS,” ”BANK,” and ”HEPMASS” dataset from the UCI data repository. We also talk about recent developments in the research into Big Data machines and provide future research directions.

show abstract

Section: Big Data Technologiesmentioning

confidence: 99%

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem

Junaid

Ali

Siddiqui

et al. 2022

Wireless Pers Commun

View full text Add to dashboard Cite

show abstract

“…However, multi-threaded lightweight processes can run on Spark inside Java virtual machine (JVM). Spark can upload and download the data from Apache Hadoop by accessing Hadoop distributed file system (HDFS) since it works on top of the existing Hadoop cluster [18]. The management of several operations is quite simple with Apache Spark by providing a data pipeline method.…”

Section: Alternating Least Squares With Sparkmentioning

confidence: 99%

Big data cloud-based recommendation system using NLP techniques with machine and deep learning

Omar¹,

Frikha²,

Jumaa³

2023

TELKOMNIKA

View full text Add to dashboard Cite

Recommendation systems (RS) are crucial for social networking sites. Without it, finding precise products is harder. However, existing systems lack adequate efficiency, especially with big data. This paper presents a prototype cloud-based recommendation system for processing big data. The proposed work is implemented by utilizing the matrix factorization method with three approaches. In the first approach, singular value decomposition (SVD) is used, which is an old and traditional recommendation technique. The second recommendation approach is fine-tuned using the alternating least squares (ALS) algorithm with Apache Spark. Finally, the deep neural network (DNN) algorithm is utilized with TensorFlow. This study solves the challenge of handling large-scale datasets in the collaborative filtering (CF) technique after tuning the algorithms by adjusting the parameters in the second approach, which uses machine learning, as well as in the third approach, which uses deep learning. Furthermore, the results of these two approaches outperformed conventional techniques and achieved an acceptable computational time. The dataset size is about 1.5 GB and it is collected from the Goodreads website API. Moreover, the Hadoop distributed file system (HDFS) is used as cloud storage instead of the computer's local disk for handling larger dataset sizes in the future.

show abstract

“…However, Spark supports four programming environments which are Java, Scala, Python, and R [20], [21]. Using Scala of Spark increases the speed computation of the algorithms and completes them in less time as compared to Java furthermore, the favorites of Scala noticed in supervised ML algorithms such as regression and unsupervised ML algorithms like clustering [22].…”

Section: Apache Sparkmentioning

confidence: 99%

Distributed big data analysis using spark parallel data processing

Omar

Jumaa

2022

Bulletin EEI

Self Cite

View full text Add to dashboard Cite

Nowadays, the big data marketplace is rising rapidly. The big challenge is finding a system that can store and handle a huge size of data and then processing that huge data for mining the hidden knowledge. This paper proposed a comprehensive system that is used for improving big data analysis performance. It contains a fast big data processing engine using Apache Spark and a big data storage environment using Apache Hadoop. The system tests about 11 Gigabytes of text data which are collected from multiple sources for sentiment analysis. Three different machine learning (ML) algorithms are used in this system which is already supported by the Spark ML package. The system programs were written in Java and Scala programming languages and the constructed model consists of the classification algorithms as well as the pre-processing steps in a figure of ML pipeline. The proposed system was implemented in both central and distributed data processing. Moreover, some datasets manipulation manners have been applied in the system tests to check which manner provides the best accuracy and time performance. The results showed that the system works efficiently for treating big data, it gains excellent accuracy with fast execution time especially in the distributed data nodes.

show abstract

Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java

Cited by 14 publications

References 14 publications

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem

Big data cloud-based recommendation system using NLP techniques with machine and deep learning

Distributed big data analysis using spark parallel data processing

Contact Info

Product

Resources

About