MapReduce‐based parallel GEP algorithm for efficient function mining in big data applications

Liu, Yang; Ma, Chen-Xiao; Xu, Lixiong; Shen, Xiaodong; Li, Maozhen; Li, Pengcheng

doi:10.1002/cpe.4379

Cited by 5 publications

(6 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The differences between Spark and Hadoop in intermediate data buffer result in high performance of iterative applications and interactive data mining with Spark. 17,25 Dharanipragada et al proposed Generate-Map-Reduce (GMR), which was an extension to MapReduce, to support iterative jobs and a distributed communication model by using shared data structures. GMR captured recursive computations by modeling iterative applications, such as simulated annealing and A* search.…”

Section: 2mentioning

confidence: 99%

See 1 more Smart Citation

Performance enhancement for iterative data computing with in‐memory concurrent processing

Wen

Chen

Chiu

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary The big data era has resulted in the development of several data analysis tools. Spark is a type of in‐memory processing fitted iteration and interactive data mining tool. This tool possesses higher data‐processing performance than MapReduce, which is an offline storage mechanism. However, some disadvantages of in‐memory processing, such as massive in‐memory data requirements, cause cross‐node data transfer that result in a long computation time. The performance of the process can be improved if the in‐memory process is executed with fewer shuffle instructions. Therefore, this study aims to enhance the performance of iterative application through instruction replacement. Three empirical research cases with diverse datasets and iterations are used to modify the program. We adopt a strategy of downloading a small resilient distributed dataset and replacing the shuffle‐included instructions to shorten the processing time with an automated code replacement by using exhaustively code matching. The experimental results reveal an improvement of up to 39% in the execution time compared with the existing in‐memory processing programs with various dataset sizes.

show abstract

Section: 2mentioning

confidence: 99%

“…Li and Shen evaluated the handling platform between local and remote file systems for a given application. 4,25 Samadi et al compared the performance according to the criteria execution time, throughput, and speedup. 6 They had evaluated the performance observed by Spark is higher than Hadoop.…”

Section: Related Work Comparisonsmentioning

confidence: 99%

Performance enhancement for iterative data computing with in‐memory concurrent processing

Wen

Chen

Chiu

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…However, the authors still report that low efficiency issue occurs when the algorithms are dealing with the large-volume load data due to the algorithm overhead. As a result, Liu et al (2016), Liu et al (2017), and finally introduce the distributed computing to improve the efficiency of the large-scale load data classification. The authors report that because of the difficulties in the algorithm decoupling, the ensemble learning technology is a necessary tool to implement algorithm parallelization.…”

Section: Introductionmentioning

confidence: 99%

An improved selective ensemble learning approach in enabling load classification considering base classifier redundancy and class imbalance

Wang¹,

Ding²,

Hua³

et al. 2022

Front. Energy Res.

Self Cite

View full text Add to dashboard Cite

In modern power systems, analyzing the behaviors of the end users can help to improve the system’s security, stability, and economy. Load classification provides an efficient way to implement awareness of the user’s behaviors. However, due to the development of data collection, transmission, and storage technologies, the volumes of the load data keep increasing. Meanwhile, the structure and knowledge hidden in the data become ever more complicated. Therefore, the parallelized ensemble learning method has been widely employed in recent load classification research. Although the positive performance of ensemble learning has been proven, two critical issues remain: class imbalance and base classifier redundancy. These issues raise challenges of improving the classification accuracy and saving computational resources. Therefore, to solve the issues, this article presents an improved selective ensemble learning approach to enable load classification considering base classifier redundancy and class imbalance. First, a Gaussian SMOTE based on density clustering (GSDC) is introduced to handle the class imbalance, which aims to achieve higher classification accuracy. Second, the classifier pruning strategy and the optimization strategy of the ensemble learning are further introduced to handle the base classifier redundancy. The experimental results indicate that when combined with the popular classifiers, the presented approach shows effectiveness for serving the load classification tasks.

show abstract

“…There have been many attempts to improve its performance, especially for data characterized by massive amount. For example Liu et al 2 considered the parallelizing GEP algorithm to enable large-scale classi¯cation, using majority-voting to combine a number of GEP-based classi¯ers obtained for separate data chunks.…”

Section: Introductionmentioning

confidence: 99%

Implementing Gene Expression Programming in the Parallel Environment for Big Datasets’ Classification

Jȩdrzejowicz

Jędrzejowicz

Wierzbowska

2019

Vietnam J. Comp. Sci.

View full text Add to dashboard Cite

The paper investigates a Gene Expression Programming (GEP)-based ensemble classifier constructed using the stacked generalization concept. The classifier has been implemented with a view to enable parallel processing with the use of Spark and SWIM — an open source genetic programming library. The classifier has been validated in computational experiments carried out on benchmark datasets. Also, it has been inbvestigated how the results are influenced by some settings. The paper is an extension of a previous paper of the authors.

show abstract

MapReduce‐based parallel GEP algorithm for efficient function mining in big data applications

Cited by 5 publications

References 21 publications

Performance enhancement for iterative data computing with in‐memory concurrent processing

Performance enhancement for iterative data computing with in‐memory concurrent processing

An improved selective ensemble learning approach in enabling load classification considering base classifier redundancy and class imbalance

Implementing Gene Expression Programming in the Parallel Environment for Big Datasets’ Classification

Contact Info

Product

Resources

About