Nowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job result among each phase must be stored before the following phase is started as well as to the replication delays. Apache Spark is another tool that developed and established to be the real model for analyzing big data with its innovative processing framework inside the memory and high-level programming libraries for machine learning, efficient data treating and etc. In this paper, some comparisons are presented about the time performance evaluation among Scala and Java in apache spark MLlib. Many tests have been done in supervised and unsupervised machine learning methods with utilizing big datasets. However, loading the datasets from Hadoop HDFS as well as to the local disk to identify the pros and cons of each manner and discovering perfect reading or loading dataset situation to reach best execution style. The results showed that the performance of Scala about 10% to 20% is better than Java depending on the algorithm type. The aim of the study is to analyze big data with more suitable programming languages and as consequences gaining better performance.
CPU scheduling algorithms have a significant function in multiprogramming operating systems. When the CPU scheduling is effective a high rate of computation could be done correctly and also the system will maintain in a stable state. As well as, CPU scheduling algorithms are the main service in the operating systems that fulfill the maximum utilization of the CPU. This paper aims to compare the characteristics of the CPU scheduling algorithms towards which one is the best algorithm for gaining a higher CPU utilization. The comparison has been done between ten scheduling algorithms with presenting different parameters, such as performance, algorithm’s complexity, algorithm’s problem, average waiting times, algorithm’s advantages-disadvantages, allocation way, etc. The main purpose of the article is to analyze the CPU scheduler in such a way that suits the scheduling goals. However, knowing the algorithm type which is most suitable for a particular situation by showing its full properties.
Nowadays, the big data marketplace is rising rapidly. The big challenge is finding a system that can store and handle a huge size of data and then processing that huge data for mining the hidden knowledge. This paper proposed a comprehensive system that is used for improving big data analysis performance. It contains a fast big data processing engine using Apache Spark and a big data storage environment using Apache Hadoop. The system tests about 11 Gigabytes of text data which are collected from multiple sources for sentiment analysis. Three different machine learning (ML) algorithms are used in this system which is already supported by the Spark ML package. The system programs were written in Java and Scala programming languages and the constructed model consists of the classification algorithms as well as the pre-processing steps in a figure of ML pipeline. The proposed system was implemented in both central and distributed data processing. Moreover, some datasets manipulation manners have been applied in the system tests to check which manner provides the best accuracy and time performance. The results showed that the system works efficiently for treating big data, it gains excellent accuracy with fast execution time especially in the distributed data nodes.
Recommendation systems (RS) are crucial for social networking sites. Without it, finding precise products is harder. However, existing systems lack adequate efficiency, especially with big data. This paper presents a prototype cloud-based recommendation system for processing big data. The proposed work is implemented by utilizing the matrix factorization method with three approaches. In the first approach, singular value decomposition (SVD) is used, which is an old and traditional recommendation technique. The second recommendation approach is fine-tuned using the alternating least squares (ALS) algorithm with Apache Spark. Finally, the deep neural network (DNN) algorithm is utilized with TensorFlow. This study solves the challenge of handling large-scale datasets in the collaborative filtering (CF) technique after tuning the algorithms by adjusting the parameters in the second approach, which uses machine learning, as well as in the third approach, which uses deep learning. Furthermore, the results of these two approaches outperformed conventional techniques and achieved an acceptable computational time. The dataset size is about 1.5 GB and it is collected from the Goodreads website API. Moreover, the Hadoop distributed file system (HDFS) is used as cloud storage instead of the computer's local disk for handling larger dataset sizes in the future.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.