Evaluating feature selection strategies for high dimensional, small sample size datasets

Golugula, Abhishek; Lee, George; Madabhushi, Anant

doi:10.1109/iembs.2011.6090214

Cited by 24 publications

(20 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Stepwise feature selection (27) was used with multiple linear regression analysis to select the subset of features for the classification task by using the Wilks L criterion. To reduce database bias, stepwise feature selection and classification were conducted concurrently within a leave-one-case-out cross-validation manner.…”

Section: Classifier Developmentmentioning

confidence: 99%

Digital Mammography in Breast Cancer: Additive Value of Radiomics of Breast Parenchyma

Mendel

et al. 2019

Radiology

View full text Add to dashboard Cite

Background: Previous studies have suggested that breast parenchymal texture features may reflect the biologic risk factors associated with breast cancer development. Therefore, combining the characteristics of normal parenchyma from the contralateral breast with radiomic features of breast tumors may improve the accuracy of digital mammography in the diagnosis of breast cancer. Purpose: To determine whether the addition of radiomic analysis of contralateral breast parenchyma to the characterization of breast lesions with digital mammography improves lesion classification over that with radiomic tumor features alone. Materials and Methods: This HIPAA-compliant, retrospective study included 182 patients (age range, 25-90 years; mean age, 55.9 years 6 14.9) who underwent mammography between June 2002 and July 2009. There were 106 malignant and 76 benign lesions. Automatic lesion segmentation and radiomic analysis were performed for each breast lesion. Radiomic texture analysis was applied in the normal regions of interest in the contralateral breast parenchyma to assess the mammographic parenchymal patterns. The classification performance of both individual features and the output from a Bayesian artificial neural network classifier was evaluated with the leave-one-patient-out method by using the area under the receiver operating characteristic curve (AUC) as the figure of merit in the task of differentiating between malignant and benign lesions. Results: The performance of the combined lesion and parenchyma classifier in the differentiation between malignant and benign mammographic lesions was better than that with the lesion features alone (AUC = 0.84 6 0.03 vs 0.79 6 0.03, respectively; P = .047). Overall, six radiomic features-spiculation, margin sharpness, size, circularity from the tumor feature set, and skewness and power law beta from the parenchymal feature set-were selected more than 50% of the time during the feature selection process on the combined feature set. Conclusion: Combining quantitative radiomic data from tumors with contralateral parenchyma characterizations may improve diagnostic accuracy for breast cancer.

show abstract

Section: Classifier Developmentmentioning

confidence: 99%

Digital Mammography in Breast Cancer: Additive Value of Radiomics of Breast Parenchyma

Mendel

et al. 2019

Radiology

View full text Add to dashboard Cite

show abstract

“…Selecting a feature subset of low cardinality and high discrimination power has been a centre-stage quest since the dawn of pattern recognition [1,2,3,4]. Feature selection from high-dimensional data has been extensively studied [5,6,7,8,9,10,11,12]. In many cases, feature selection is sought as the end goal of the data analysis.…”

Section: Introductionmentioning

confidence: 99%

On feature selection protocols for very low-sample-size data

Kuncheva

Rodrı́guez

2018

Pattern Recognition

View full text Add to dashboard Cite

High-dimensional data with very few instances are typical in many application domains. Selecting a highly discriminative subset of the original features is often the main interest of the end user. The widely-used feature selection protocol for such type of data consists of two steps. First, features are selected from the data (possibly through cross-validation), and, second, a cross-validation protocol is applied to test a classifier using the selected features. The selected feature set and the testing accuracy are then returned to the user. For the lack of a better option, the same low-sample-size dataset is used in both steps. Questioning the validity of this protocol, we carried out an experiment using 24 high-dimensional datasets, three feature selection methods and five classifier models. We found that the accuracy returned by the above protocol is heavily biased, and therefore propose an alternative protocol which avoids the contamination by including both steps in a single cross-validation loop. Statistical tests verify that the classification accuracy returned by the proper protocol is significantly closer to the true accuracy (estimated from an independent testing set) compared to that returned by the currently favoured protocol.

show abstract

“…In this study, a range of methods were used. These methods are outlined below; they have been reported on extensively previously [ 30‐32,37,39,41‐59 ] and detailed methods are also included as Appendix S1.…”

Section: Methodsmentioning

confidence: 99%

Development of a Gaussian Process – feature selection model to characterise (poly)dimethylsiloxane (Silastic®) membrane permeation

Sun

Hewitt

Wilkinson

et al. 2020

Journal of Pharmacy and Pharmacology

View full text Add to dashboard Cite

Objectives The current study aims to determine the effect of physicochemical descriptor selection on models of polydimethylsiloxane permeation. Methods A total of 2942 descriptors were calculated for a data set of 77 chemicals. Data were processed to remove redundancy, single values, imbalanced and highly correlated data, yielding 1363 relevant descriptors. For four independent test sets, feature selection methods were applied and modelled via a variety of Machine Learning methods. Key findings Two sets of molecular descriptors which can provide improved predictions, compared to existing models, have been identified. Best permeation predictions were found with Gaussian Process methods. The molecular descriptors describe lipophilicity, partial charge and hydrogen bonding as key determinants of PDMS permeation. Conclusions This study highlights important considerations in the development of relevant models and in the construction and use of the data sets used in such studies, particularly that highly correlated descriptors should be removed from data sets. Predictive models are improved by the methodology adopted in this study, notably the systematic evaluation of descriptors, rather than simply using any and all available descriptors, often based empirically on in vitro experiments. Such findings also have clear relevance to a number of other fields.

show abstract

Evaluating feature selection strategies for high dimensional, small sample size datasets

Cited by 24 publications

References 12 publications

Digital Mammography in Breast Cancer: Additive Value of Radiomics of Breast Parenchyma

Digital Mammography in Breast Cancer: Additive Value of Radiomics of Breast Parenchyma

On feature selection protocols for very low-sample-size data

Development of a Gaussian Process – feature selection model to characterise (poly)dimethylsiloxane (Silastic®) membrane permeation

Contact Info

Product

Resources

About