An instance selection method for large datasets based on Markov Geometric Diffusion

Silva, Duílio A.N.S.; Souza, Leandro Carlos de; Motta, Gustavo H. M. B.

doi:10.1016/j.datak.2015.11.002

Cited by 13 publications

(4 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Due to this problem, minority class samples are less attended to, affecting incorrect classification results [ 18 ]. The classification error of an unbalanced data set is exacerbated by the limited number of samples and a large number of features [ 20 , 21 ]. Therefore, it is necessary to consider selecting an appropriate analysis model based on such unbalanced data in computer modeling.…”

Section: Discussionmentioning

confidence: 99%

Prediction of lymphedema occurrence in patients with breast cancer using the optimized combination of ensemble learning algorithm and feature selection

Notash

Omidi

et al. 2022

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

Background Breast cancer-related lymphedema is one of the most important complications that adversely affect patients' quality of life. Lymphedema can be managed if its risk factors are known and can be modified. This study aimed to select an appropriate model to predict the risk of lymphedema and determine the factors affecting lymphedema. Method This study was conducted on data of 970 breast cancer patients with lymphedema referred to a lymphedema clinic. This study was designed in two phases: developing an appropriate model to predict the risk of lymphedema and identifying the risk factors. The first phase included data preprocessing, optimizing feature selection for each base learner by the Genetic algorithm, optimizing the combined ensemble learning method, and estimating fitness function for evaluating an appropriate model. In the second phase, the influential variables were assessed and introduced based on the average number of variables in the output of the proposed algorithm. Result Once the sensitivity and accuracy of the algorithms were evaluated and compared, the Support Vector Machine algorithm showed the highest sensitivity and was found to be the superior model for predicting lymphedema. Meanwhile, the combined method had an accuracy coefficient of 91%. The extracted significant features in the proposed model were the number of lymph nodes to the number of removed lymph nodes ratio (68%), feeling of heaviness (67%), limited range of motion in the affected limb (65%), the number of the removed lymph nodes ( 64%), receiving radiotherapy (63%), misalignment of the dominant and the involved limb (62%), presence of fibrotic tissue (62%), type of surgery (62%), tingling sensation (62%), the number of the involved lymph nodes (61%), body mass index (61%), the number of chemotherapy sessions (60%), age (58%), limb injury (53%), chemotherapy regimen (53%), and occupation (50%). Conclusion Applying a combination of ensemble learning approach with the selected classification algorithms, feature selection, and optimization by Genetic algorithm, Lymphedema can be predicted with appropriate accuracy. Developing applications by effective variables to determine the risk of lymphedema can help lymphedema clinics choose the proper preventive and therapeutic method.

show abstract

Section: Discussionmentioning

confidence: 99%

Prediction of lymphedema occurrence in patients with breast cancer using the optimized combination of ensemble learning algorithm and feature selection

Notash

Omidi

et al. 2022

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

show abstract

“…Data reduction may be in terms of the number of rows (instances) or in terms of the number of columns (features) ( Aggarwal, 2015 ). In this sense, three main approaches have been proposed: (1) feature selection ( Cheng, Cai, Zhang, Xu, & Su, 2015;Ganapathi & Duraivelu, 2015;Xia, Fang, & Zhang, 2014 ), (2) instance selection ( García, Luengo, & Herrera, 2015;de Oliveira Moura, de Freitas, Cardoso, & Cavalcanti, 2014;Silva, Souza, & Motta, 2016 ) and (3) hybrid, where feature selection and instance selection are combined ( Chen, Zhang, Jin, & Kim, 2014 ).…”

Section: Related Workmentioning

confidence: 99%

A data reduction strategy and its application on scan and backscatter detection using rule-based classifiers

Herrera-Semenets

Pérez-García

Hernández-León

et al. 2018

Expert Systems with Applications

View full text Add to dashboard Cite

“…This problem causes underestimation of the minority class examples and produces bias and inaccurate classification results toward the majority class examples [ 1 ]. Classification of an imbalanced data set becomes more severe with limited number samples and a huge number of features [ 3 , 4 ].…”

Section: Introductionmentioning

confidence: 99%

Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm

Sharifai

Zainol

2020

Genes

View full text Add to dashboard Cite

The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.

show abstract

An instance selection method for large datasets based on Markov Geometric Diffusion

Cited by 13 publications

References 29 publications

Prediction of lymphedema occurrence in patients with breast cancer using the optimized combination of ensemble learning algorithm and feature selection

Prediction of lymphedema occurrence in patients with breast cancer using the optimized combination of ensemble learning algorithm and feature selection

A data reduction strategy and its application on scan and backscatter detection using rule-based classifiers

Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm

Contact Info

Product

Resources

About