Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. 'omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms inspired by the 'Relief' algorithm, i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an open source framework called ReBATE (Relief-Based Algorithm Training Environment). We apply a comprehensive genetic simulation study comparing existing RBAs, a proposed RBA called MultiSURF, and other established feature selection methods, over a variety of problems. The results of this study (1) support the assertion that RBAs are particularly flexible, efficient, and powerful feature selection methods that differentiate relevant features having univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm the efficacy of expansions for classification vs. regression, discrete vs. continuous features, missing data, multiple classes, or class imbalance, (3) identify previously unknown limitations of specific RBAs, and (4) suggest that while MultiSURF performs best for explicitly identifying pure 2-way interactions, MultiSURF yields the most reliable feature selection performance across a wide range of problem types.
Background-A combination of biomarkers in a multivariate model may predict disease with greater accuracy than a single biomarker employed alone. We developed a non-linear method of multivariate analysis, weighted digital analysis (WDA), and evaluated its ability to predict lung cancer employing volatile biomarkers in the breath.
We sought biomarkers of breast cancer in the breath because the disease is accompanied by increased oxidative stress and induction of cytochrome P450 enzymes, both of which generate volatile organic compounds (VOCs) that are excreted in breath. We analyzed breath VOCs in 54 women with biopsy-proven breast cancer and 204 cancer-free controls, using gas chromatography/mass spectroscopy. Chromatograms were converted into a series of data points by segmenting them into 900 time slices (8 s duration, 4 s overlap) and determining their alveolar gradients (abundance in breath minus abundance in ambient room air). Monte Carlo simulations identified time slices with better than random accuracy as biomarkers of breast cancer by excluding random identifiers. Patients were randomly allocated to training sets or test sets in 2:1 data splits. In the training sets, time slices were ranked according their C-statistic values (area under curve of receiver operating characteristic), and the top ten time slices were combined in multivariate algorithms that were cross-validated in the test sets. Monte Carlo simulations identified an excess of correct over random time slices, consistent with non-random biomarkers of breast cancer in the breath. The outcomes of ten random data splits (mean (standard deviation)) in the training sets were sensitivity = 78.5% (6.14), specificity = 88.3% (5.47), C-statistic = 0.89 (0.03) and in the test sets, sensitivity = 75.3% (7.22), specificity = 84.8 (9.97), C-statistic = 0.83 (0.06). A breath test identified women with breast cancer, employing a combination of volatile biomarkers in a multivariate algorithm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.