A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance

Rahman, Md. Armanur; Hossen, J.; Ho, CK; Geok, Tan Kim; Sultana, Aziza; Jesmeen, M. Z. H.; Hossain, Ferdous

doi:10.11591/ijece.v8i3.pp1854-1862

Cited by 9 publications

(8 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The genetic algorithm [14] is a popular algorithm in parameter optimization. MSET [9] is a typical parameter optimization algorithm based on the genetic algorithm.…”

Section: Related Workmentioning

confidence: 99%

“…At present, the methods of configuration parameter optimization for MapReduce mainly include the combination of configuration parameters, and parameter optimization methods based on simulators, experience principles, and machine learning [10], [11], [14], [16]. In the process of parameter optimization, these methods take all parameter into account.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Novel Configuration Tuning Method Based on Feature Selection for Hadoop MapReduce

Liu

Tang

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Configuration parameter optimization is an important means of improving the performance of the MapReduce model. The existing parameter tuning methods usually optimize all configuration parameters in MapReduce. However, it is exceedingly challenging to tune all the parameters for the MapReduce model because there are massive configuration parameters in MapReduce. In this paper, a novel configuration parameter tuning method based on a feature selection algorithm is proposed, and it is composed of the feature selection objective function and feature selection process. The objective function is based on the kernel clustering algorithm, in which anisotropic Gaussian kernel is adopted instead of the traditional Gaussian kernel to accurately judge the importance of each parameter in MapReduce. Then, the relationship between the configuration parameters in MapReduce and the features in the feature selection algorithm is defined. Moreover, the importance of each parameter is reflected by the kernel width of anisotropic Gaussian kernels. At the same time, the method of gradient descent is introduced to update the kernel width and control the feature selection process of the iterative algorithm. Finally, experimental results show that the proposed algorithm performs suitably for the MapReduce model.

show abstract

“…The genetic algorithm [14] is a popular algorithm in parameter optimization. MSET [9] is a typical parameter optimization algorithm based on the genetic algorithm.…”

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Novel Configuration Tuning Method Based on Feature Selection for Hadoop MapReduce

Liu

Tang

et al. 2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…The traditional computing system cannot offer the necessary efficiency and performance. Therefore, the big data industries have seen various platforms such ad Spark [4], Haddoo [5,6] and Strom [7] to entertain the demands of a large amount of big data processing. Apache spark is one of the most widespread frameworks among the prevailing distributes framework, due to its great capability to sustenance heavy applications and for complex data processing performance [2,4].…”

Section: Introductionmentioning

confidence: 99%

Towards machine learning-based self-tuning of Hadoop-Spark system

Rahman

Hossen

et al. 2019

IJEECS

Self Cite

View full text Add to dashboard Cite

Apache Spark is an open source distributed platform which uses the concept of distributed memory for processing big data. Spark has more than 180 predominant configuration parameter. Configuration settings directly control the efficiency of Apache spark while processing big data, to get the best outcome yet a challenging task as it has many configuration parameters. Currently, these predominant parameters are tuned manually by trial and error. To overcome this manual tuning problem in this paper proposed and developed a self-tuning approach using machine learning. This approach can tune the parameter value when it’s required. The approach was implemented on Dell server and experiment was done on five different sizes of the dataset and parameter. A comparison is provided to highlight the experimented result of the proposed approach with default Spark configuration system. The results demonstrate that the execution is speeded-up by about 33% (on an average) compared to the default configuration.

show abstract

“…Multinode cluster‐based Hadoop framework structure distributed locally or in remote locations works efficiently to the storage and processing of the big data (ie, on the client‐server architecture). Figure shows the basic block diagram of Hadoop ecosystem representing 2 remote locations where the data nodes are situated and a third remote location with name node, for controlling the Hadoop multinode cluster …”

Section: Introductionmentioning

confidence: 99%

Hadoop‐based analytic framework for cyber forensics

Chhabra

Singh

2018

Int J Communication

View full text Add to dashboard Cite

With an exponential increase in the data size and complexity of various documents to be investigated, existing methods of network forensics are found not much efficient with respect to accuracy and detection ratio. The existing techniques for network forensic analysis exhibit inherent limitations while processing a huge volume, variety, and velocity of data. It makes network forensic a time-consuming and resource-consuming task. To balance time taken and output delivered, these existing techniques put a limit on the amount of data under analysis, which results in a polynomial time complexity of these solutions. So to mitigate these issues, in this paper, we propose an effective framework to overcome the limitation to handle large volume, variety, and velocity of data. An architectural setup that consists of MapReduce framework on top of Hadoop Distributed File System environment is proposed in this paper. The proposed framework demonstrates its capability to handle issues of storage and processing of big data using cloud computing. Also, in the proposed framework, supervised machine learning (random forest-based decision tree) algorithm has been implemented to demonstrate better sensitivity. To train and validate the model, online available data set from CAIDA is taken and university network traffic samples, with increasing size, has been taken for experiment. Results thus obtained confirm the superiority of the proposed framework in network forensics, with an average accuracy of 99.34% (malicious and nonmalicious traffic).

show abstract

A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance

Cited by 9 publications

References 14 publications

A Novel Configuration Tuning Method Based on Feature Selection for Hadoop MapReduce

A Novel Configuration Tuning Method Based on Feature Selection for Hadoop MapReduce

Towards machine learning-based self-tuning of Hadoop-Spark system

Hadoop‐based analytic framework for cyber forensics

Contact Info

Product

Resources

About