A fast parallel attribute reduction algorithm using Apache Spark

Liang, Yin; Qin, Liyang; Jiang, Zhaohui; Xu, Xuemei

doi:10.1016/j.knosys.2020.106582

Cited by 15 publications

(3 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Spark is a distributed computing framework based on in-memory computing with faster computation speed and better iterative performance than Hadoop MapReduce [31]. Yin et al [32] designed a new parallel attribute reduction4 based on Spark to resolve the limitations of MapReduce. Luo et al [33], [34] proposed a novel Spark parallel attribute reduction based on a rough hypercuboid model, which employed two parallel strategies, vertical partitioning and horizontal partitioning.…”

Section: A Novel Spark-based Attribute Reduction and Neighborhood Cla...mentioning

confidence: 99%

A Novel Spark-Based Attribute Reduction and Neighborhood Classification for Rough Evidence

Ding

Sun

et al. 2024

IEEE Trans. Cybern.

View full text Add to dashboard Cite

Neighborhood classification (NEC) algorithms has been widely used to solve classification problems. Most traditional NEC algorithms employ the majority voting mechanism as the basis for final decision-making. However, this mechanism hardly considers the spatial difference and label uncertainty of the neighborhood samples, which may increase the possibility of the misclassification. In addition, the traditional NEC algorithms need to load the whole data into memory at once, which is computationally inefficient when the size of data set is large. To address these problems, we propose a novel Spark-based attribute reduction and neighborhood classification for rough evidence in this paper. Specifically, we first construct a multi-granular sample space using parallel undersampling method. Then, we evaluate the significance of attribute by neighborhood rough evidence decision error rate and remove the redundant attribute on different samples subspaces. Based on this attribute reduction algorithm, we design a parallel attribute reduction algorithm which is able to compute equivalence classes in parallel and parallelize the process of searching for candidate attributes. Finally, we introduce the rough evidence into the classification decision of traditional NEC algorithms and parallelize the classification decision process. Furthermore, the proposed algorithms are conducted in the Spark parallel computing framework. Experimental results on both small and large-scale data sets show that the proposed algorithms outperform the benchmarking algorithms in the classification accuracy and the computational efficiency.

show abstract

Section: A Novel Spark-based Attribute Reduction and Neighborhood Cla...mentioning

confidence: 99%

A Novel Spark-Based Attribute Reduction and Neighborhood Classification for Rough Evidence

Ding

Sun

et al. 2024

IEEE Trans. Cybern.

View full text Add to dashboard Cite

show abstract

“…In this section, we use speedup, scaleup, and sizeup [30] to evaluate the parallel performance of the proposed algorithm and compare it with the standard Spark-MLRF algorithm.…”

Section: Parallel Performance Evaluationmentioning

confidence: 99%

A Fast Parallel Random Forest Algorithm Based on Spark

et al. 2023

Self Cite

View full text Add to dashboard Cite

To improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classification accuracy. Next, to reduce the number of candidate split points and Gini coefficient calculations for continuous features, an approximate equal-frequency binning method is proposed to determine the optimal split points efficiently. Finally, based on Apache Spark computing framework, the forest sampling index (FSI) table is defined to speed up the parallel training process of decision trees and reduce data communication overhead. Experimental results show that the proposed algorithm improves the efficiency of constructing random forests while ensuring classification accuracy, and is superior to Spark-MLRF in terms of performance and scalability.

show abstract

“…Attribute reduction [4,30,34,55] is an essential idea of rough set theory, it can select attributes or features with the highest discriminative or predictive power from the original data to reduce data redundancy or noise [20,36,54]. In the era of big data, attribute reduction can eliminate irrelevant or unimportant attributes from high-dimensional datasets [7,56], improve data quality, discover the potential value and knowledge of data, reduce the storage and computation costs of data, and improve the speed of data processing [8,57].…”

Section: Introductionmentioning

confidence: 99%

Attribute reduction based on a rapid variable granular ball generation model

Sun,

Huang,

Wang

et al. 2024

Preprint

View full text Add to dashboard Cite

Attribute reduction is a key step in processing large-scale datasets, where the Granular Ball Neighborhood Rough Set (GBNRS) can significantly enhance the performance of attribute reduction compared to the traditional Neighborhood Rough Set (NRS). However, the GBNRS algorithm faces such challenges as a sharp increase in computational costs in high-dimensional spaces. To address these issues, this study introduces a new granular ball quality index to judge the separability degree of decision classes, and on the basis of this index, a rapid variable granular ball generation model (RVGBGM) is proposed. Compared with GBNRS, RVGBGM has the following advantages: 1) it reduces the number of granular balls and can quickly reflect the separability degree of different decision classes with few granular balls, 2) it constructs granular balls by using label information and shortens the time of granular ball construction, and 3) it can adjust the radius of granular balls adaptively by using parameters to determine the optimal granular ball radius for different datasets. Finally, we compare the RVGBGM algorithm with classical attribute reduction algorithms and the current state-of-the-art granular ball algorithm on 11 datasets. The proposed algorithm significantly improves algorithm efficiency while maintaining high accuracy.

show abstract

A fast parallel attribute reduction algorithm using Apache Spark

Cited by 15 publications

References 24 publications

A Novel Spark-Based Attribute Reduction and Neighborhood Classification for Rough Evidence

A Novel Spark-Based Attribute Reduction and Neighborhood Classification for Rough Evidence

A Fast Parallel Random Forest Algorithm Based on Spark

Attribute reduction based on a rapid variable granular ball generation model

Contact Info

Product

Resources

About