Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis

Zare, Habil; Haffari, Gholamreza; Gupta, Arvind; Brinkman, Ryan R.

doi:10.1186/1471-2164-14-s1-s14

Cited by 26 publications

(21 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While the C50 package uses a heuristic approach to select the best set of features, its default arguments does not result in optimal performance when too many features are provided. The solutions include: 1) using a Bayesian network to determine the relationships of the modules with each other and with the type of hematological malignancy (Additional file 1: Note S3) [31], 2) using a feature scoring algorithm such as FeaLect [80], and 3) adjusting the C50 parameters, for example, enforcing the number of samples in each node to be at least 10%. The first and the third solutions are implemented in the Pigengene package through the bnNum argument of the one.step.pigengene() function and the minPerLeaf argument of the make.decision.tree() function, respectively.…”

Section: Methodsmentioning

confidence: 99%

Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia: an introduction to the Pigengene package and its applications

et al. 2017

Self Cite

View full text Add to dashboard Cite

BackgroundThe distinct types of hematological malignancies have different biological mechanisms and prognoses. For instance, myelodysplastic syndrome (MDS) is generally indolent and low risk; however, it may transform into acute myeloid leukemia (AML), which is much more aggressive.MethodsWe develop a novel network analysis approach that uses expression of eigengenes to delineate the biological differences between these two diseases.ResultsWe find that specific genes in the extracellular matrix pathway are underexpressed in AML. We validate this finding in three ways: (a) We train our model on a microarray dataset of 364 cases and test it on an RNA Seq dataset of 74 cases. Our model showed 95% sensitivity and 86% specificity in the training dataset and showed 98% sensitivity and 91% specificity in the test dataset. This confirms that the identified biological signatures are independent from the expression profiling technology and independent from the training dataset.(b) Immunocytochemistry confirms that MMP9, an exemplar protein in the extracellular matrix, is underexpressed in AML. (c) MMP9 is hypermethylated in the majority of AML cases (n=194, Welch’s t-test p-value <10−138), which complies with its low expression in AML.Our novel network analysis approach is generalizable and useful in studying other complex diseases (e.g., breast cancer prognosis). We implement our methodology in the Pigengene software package, which is publicly available through Bioconductor.ConclusionsEigengenes define informative biological signatures that are robust with respect to expression profiling technology. These signatures provide valuable information about the underlying biology of diseases, and they are useful in predicting diagnosis and prognosis.Electronic supplementary materialThe online version of this article (doi:10.1186/s12920-017-0253-6) contains supplementary material, which is available to authorized users.

show abstract

Section: Methodsmentioning

confidence: 99%

Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia: an introduction to the Pigengene package and its applications

et al. 2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…In 2013, the FeaLect algorithm, an improvement over the Bolasso algorithm, was developed based on the combinatorial analysis of regression coefficients estimated using LARS [20]. FeaLect considers the full regularization path, and computes the feature importance using a combinatorial scoring method, as opposed to simply taking the intersection with Bolasso.…”

Section: Wrapper Methodsmentioning

confidence: 99%

“…In this paper, we review some feature selection techniques applied to the stress hotspot prediction problem in hexagonal close packed materials, and compare them with respect to future data prediction. We focus on two commonly used techniques from each method: (1) Filter Methods: Correlation based feature selection (CFS) [18], and Pearson Correlation [19]; (2) Wrapper Methods: Fealect [20] and Recursive feature elimination (RFE) [13] and (3) Embedded Methods: Random Forest Permutation accuracy importance (RF-PAI) [21] and Least Absolute Shrinkage and Selection Operator (LASSO) [22]. The main contribution of this article is to raise awareness in the materials data science community about how different feature selection techniques can lead to misguided model interpretations and how to avoid them.…”

mentioning

confidence: 99%

A Comparative Study of Feature Selection Methods for Stress Hotspot Classification in Materials

Mangal

Holm

2018

Integr Mater Manuf Innov

View full text Add to dashboard Cite

The first step in constructing a machine learning model is defining the features of the data set that can be used for optimal learning. In this work we discuss feature selection methods, which can be used to build better models, as well as achieve model interpretability. We applied these methods in the context of stress hotspot classification problem, to determine what microstructural characteristics can cause stress to build up in certain grains during uniaxial tensile deformation. The results show how some feature selection techniques are biased and demonstrate a preferred technique to get feature rankings for physical interpretations.

show abstract

“…, clustering [28], probability binning [29], combinatorial gating [7], [10], and cluster matching [30], [31]); 3) supervised analysis ( e.g. , single-variate [32]–[34] and multi-variate models [35]); 4) characterization [8], [36] and visualization [16], [37], [38]. Not only do these components vary in their individual performance across different biological applications, their interactions with each other in a large data analysis pipeline also further complicates the choice of appropriate methods [39] The establishment of objective benchmarks like the one reported here enables further analysis in which different components are examined subject to a wide range of technical and biological variations to identify the combination with optimal performance.…”

Section: Discussionmentioning

confidence: 99%

A benchmark for evaluation of algorithms for identification of cellular correlates of clinical outcomes

et al. 2015

View full text Add to dashboard Cite

The Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) challenges were established to compare the performance of computational methods for identifying cell populations in multidimensional flow cytometry data. Here we report the results of FlowCAP-IV where algorithms from seven different research groups predicted the time to progression to AIDS among a cohort of 384 HIV+ subjects, using antigen-stimulated peripheral blood mononuclear cell (PBMC) samples analyzed with a 14-color staining panel. Two approaches (FlowReMi.1 and flowDensity-flowType-RchyOptimyx) provided statistically significant predictive value in the blinded test set. Manual validation of submitted results indicated that unbiased analysis of single cell phenotypes could reveal unexpected cell types that correlated with outcomes of interest in high dimensional flow cytometry datasets.

show abstract

Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis

Cited by 26 publications

References 14 publications

Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia: an introduction to the Pigengene package and its applications

Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia: an introduction to the Pigengene package and its applications

A Comparative Study of Feature Selection Methods for Stress Hotspot Classification in Materials

A benchmark for evaluation of algorithms for identification of cellular correlates of clinical outcomes

Contact Info

Product

Resources

About