Improving instance selection methods for big data classification

Malhat, Mohamed; Menshawy, Mohamed El; Mousa, Hamdy M.; Sisi, Ashraf El

doi:10.1109/icenco.2017.8289790

Cited by 3 publications

(11 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A systematic method of sampling that mediates the tensions between resource constraints, data characteristics, and the learning algorithms accuracy is needed [20]. Intricate methods to subset the big data, such as instance selection [21] and inverse sampling [22], are computationally expensive [23,24] because of the inefficient multiple preprocessing steps. Furthermore, newly added data points change data statistical measures and require re-sampling.…”

Section: Techniques For Data Reductionmentioning

confidence: 99%

Bayes Classification and Entropy Discretization of Large Datasets using Multi-Resolution Data Aggregation

Alwajidi¹,

Yang²

2020

Adv. sci. technol. eng. syst. j.

View full text Add to dashboard Cite

Big data analysis has important applications in many areas such as sensor networks and connected healthcare. High volume and velocity of big data bring many challenges to data analysis. One possible solution is to summarize the data and provides a manageable data structure to hold a scalable summarization of data for efficient and effective analysis. This research extends our previous work on developing an effective technique to create, organize, access, and maintain summarization of big data and develops algorithms for Bayes classification and entropy discretization of large data sets using the multi-resolution data summarization structure. Bayes classification and data discretization play essential roles in many learning algorithms such as decision tree and nearest neighbor search. The proposed method can handle streaming data efficiently and, for entropy discretization, provide sufficient information to find the optimal split variable and the optimal split value.

show abstract

Section: Techniques For Data Reductionmentioning

confidence: 99%

Bayes Classification and Entropy Discretization of Large Datasets using Multi-Resolution Data Aggregation

Alwajidi¹,

Yang²

2020

Adv. sci. technol. eng. syst. j.

View full text Add to dashboard Cite

show abstract

“…The continuous growth of data size makes the traditional IS methods unable to process training dataset in a single machine, due to memory limitations [ 9]. Therefore, new approaches are proposed that partition the training dataset into subsets and apply IS methods to each subset separately [10][11][12]. The approach in [ 10] uses random partitioning to partition a given training dataset into a group of manageable subsets.…”

Section: Introductionmentioning

confidence: 99%

“…However, the performance of the applied IS method to the partitioned subsets is degraded, especially for class-imbalanced datasets. In order to overcome this limitation, the approaches in [ 11,12] use stratification partitioning to ensure the equal distribution of data classes into subsets, while the instances of the same class are assigned randomly to subsets. The common feature of these approaches [10][11][12] is the random partitioning of the instances, which leads to a random representation of the instances in the partitioned subsets.…”

Section: Introductionmentioning

confidence: 99%

“…In order to overcome this limitation, the approaches in [ 11,12] use stratification partitioning to ensure the equal distribution of data classes into subsets, while the instances of the same class are assigned randomly to subsets. The common feature of these approaches [10][11][12] is the random partitioning of the instances, which leads to a random representation of the instances in the partitioned subsets. This representation is insufficient for the employed IS method to get acceptable results, especially when highly scales up the number of subsets.…”

Section: Introductionmentioning

confidence: 99%

“…In order to assess the importance of overlapping, we develop a non-overlapped version from our approach called Class-balance Distance-based Partitioning (CDP). We compare the OCDP approach with the stratification partitioning used in [ 11,12] and the developed CDP approach in terms of 1) reduction rate, classification accuracy, and effectiveness, and 2) scalability aspect. Our experimental results prove that the OCDP approach has a better reduction rate and effectiveness results than the stratification and CDP approaches.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Novel Scalable and Effective Partitioning Approach for Big Data Reduction

Malhat

Elmenshawy

Mousa

et al. 2019

IJCI. International Journal of Computers and Information

Self Cite

View full text Add to dashboard Cite

The continuous increment of data size makes the traditional instance selection methods ineffective to reduce big training datasets in a single machine. Recent approaches to solving this technical problem partition the training dataset into subsets prior to apply the instance selection methods into each subset separately. However, the performance of the applied instance selection methods to subsets is negatively affected, especially when the number of partitioned subsets is increased. In this work, we propose a novel scalable and effective automated partitioning approach, called overlapped distance-based class-balance partitioning. This approach distributes the training dataset instances to the partitioned subsets based on a given distance metric and ensures the equal representation of data classes into partitioned subsets. Moreover, the instances might be assigned to two subsets once they satisfy the dynamic threshold. We implement and test empirically the scalability and effectiveness of the proposed approach using condensed nearest neighbor method over eight standard datasets. The proposed approach is compared empirically and analytically with stratification partitioning approach and a non-overlapped version from our approach with respect to 1) the reduction rate, classification accuracy, and effectiveness metrics, and 2) the scalability aspect, where the number of subsets is increased. The comparison results demonstrate that our approach is more scalable and effective than other partitioning approaches with respect to these standard datasets.

show abstract