Use of Wrapper Algorithms Coupled with a Random Forests Classifier for Variable Selection in Large-Scale Genomic Association Studies

Rodin, Andréi S.; Litvinenko, A.; Klos, Kathy L.E.; Morrison, Alanna C.; Woodage, Trevor; Coresh, Josef; Boerwinkle, Eric

doi:10.1089/cmb.2008.0037

Cited by 22 publications

(15 citation statements)

References 30 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Subsequently, a set of features producing the highest accuracy by cross-validation was identified as the optimal feature subset. Many previous studies preferred to select SVM as the learning scheme due to its superiority compared to the other classifiers [12,38], but the RF classifier has also recently been used [39]. Since RF and SVM classifiers were employed as the classification techniques tested in this study (see Section 2.4), we tested two wrapper methods, and the learning schemes were set to RF and SVM classifiers, respectively, to achieve the best possible classification performance for feature selection.…”

Section: (3) Svm Recursive Feature Elimination (Svm-rfe)mentioning

confidence: 99%

Evaluation of Feature Selection Methods for Object-Based Land Cover Mapping of Unmanned Aerial Vehicle Imagery Using Random Forest and Support Vector Machine Classifiers

Blaschke

et al. 2017

IJGI

182

117

View full text Add to dashboard Cite

Abstract:The increased feature space available in object-based classification environments (e.g., extended spectral feature sets per object, shape properties, or textural features) has a high potential of improving classifications. However, the availability of a large number of derived features per segmented object can also lead to a time-consuming and subjective process of optimizing the feature subset. The objectives of this study are to evaluate the effect of the advanced feature selection methods of popular supervised classifiers (Support Vector Machines (SVM) and Random Forest (RF)) for the example of object-based mapping of an agricultural area using Unmanned Aerial Vehicle (UAV) imagery, in order to optimize their usage for object-based agriculture pattern recognition tasks. In this study, several advanced feature selection methods were divided into both types of classifiers (SVM and RF) to conduct further evaluations using five feature-importance-evaluation methods and three feature-subset-evaluation methods. A visualization method was used to measure the change pattern of mean classification accuracy with the increase of features used, and a two-tailed t-test was used to determine the difference between two population means for both repeated ten classification accuracies. This study mainly contribute to the uncertainty analysis of feature selection for object-based classification instead of the per-pixel method. The results highlight that the RF classifier is relatively insensitive to the number of input features, even for a small training set size, whereby a negative impact of feature set size on the classification accuracy of the SVM classifier was observed. Overall, the SVM Recursive Feature Elimination (SVM-RFE) seems to be an appropriate method for both groups of classifiers, while the Correlation-based Feature Selection (CFS) is the best feature-subset-evaluation method. Most importantly, this study verified that feature selection for both classifiers is crucial for the evolving field of Object-based Image Analysis (OBIA): It is highly advisable for feature selection to be performed before object-based classification, even though an adverse impact could sometimes be observed from the wrapper methods.

show abstract

Section: (3) Svm Recursive Feature Elimination (Svm-rfe)mentioning

confidence: 99%

Evaluation of Feature Selection Methods for Object-Based Land Cover Mapping of Unmanned Aerial Vehicle Imagery Using Random Forest and Support Vector Machine Classifiers

Blaschke

et al. 2017

IJGI

182

117

View full text Add to dashboard Cite

show abstract

“…Díaz-Uriarte and Alvarez de Andrés [2006] suggested removing the bottom 10% and re-running until prediction decreased. Rodin et al [2009] devised a method for selecting variables based on specification of optimal model size. Goldstein et al [2010] examined the scree plots of the VI measures and used the "elbow" as the cut-off.…”

Section: Determining Important Variablesmentioning

confidence: 99%

Random Forests for Genetic Association Studies

Goldstein

Polley²,

Briggs

2011

Statistical Applications in Genetics and Molecular Biology

214

172

View full text Add to dashboard Cite

The Random Forests (RF) algorithm has become a commonly used machine learning algorithm for genetic association studies. It is well suited for genetic applications since it is both computationally efficient and models genetic causal mechanisms well. With its growing ubiquity, there has been inconsistent and less than optimal use of RF in the literature. The purpose of this review is to breakdown the theoretical and statistical basis of RF so that practitioners are able to apply it in their work. An emphasis is placed on showing how the various components contribute to bias and variance, as well as discussing variable importance measures. Applications specific to genetic studies are highlighted. To provide context, RF is compared to other commonly used machine learning algorithms.

show abstract

“…Random Forests, in particular, is a randomized decision tree ensemble that has attractive scalability properties (proportional to the square root of the number of variables) in the approximately 500,000–1 million variables range, which makes it very appealing to GWAS and similar analyses ([26,30,31], see also [32] for a recent overview). Numerous software implementations specifically aimed at the genomic data exist.…”

Section: Machine Learning Methodsmentioning

confidence: 99%

Systems Biology Data Analysis Methodology in Pharmacogenomics

2011

Self Cite

View full text Add to dashboard Cite

Pharmacogenetics aims to elucidate the genetic factors underlying the individual’s response to pharmacotherapy. Coupled with the recent (and ongoing) progress in high-throughput genotyping, sequencing and other genomic technologies, pharmacogenetics is rapidly transforming into pharmacogenomics, while pursuing the primary goals of identifying and studying the genetic contribution to drug therapy response and adverse effects, and existing drug characterization and new drug discovery. Accomplishment of both of these goals hinges on gaining a better understanding of the underlying biological systems; however, reverse-engineering biological system models from the massive datasets generated by the large-scale genetic epidemiology studies presents a formidable data analysis challenge. In this article, we review the recent progress made in developing such data analysis methodology within the paradigm of systems biology research that broadly aims to gain a ‘holistic’, or ‘mechanistic’ understanding of biological systems by attempting to capture the entirety of interactions between the components (genetic and otherwise) of the system.

show abstract

Use of Wrapper Algorithms Coupled with a Random Forests Classifier for Variable Selection in Large-Scale Genomic Association Studies

Cited by 22 publications

References 30 publications

Evaluation of Feature Selection Methods for Object-Based Land Cover Mapping of Unmanned Aerial Vehicle Imagery Using Random Forest and Support Vector Machine Classifiers

Evaluation of Feature Selection Methods for Object-Based Land Cover Mapping of Unmanned Aerial Vehicle Imagery Using Random Forest and Support Vector Machine Classifiers

Random Forests for Genetic Association Studies

Systems Biology Data Analysis Methodology in Pharmacogenomics

Contact Info

Product

Resources

About