2013
DOI: 10.1186/1471-2164-14-s1-s14
|View full text |Cite
|
Sign up to set email alerts
|

Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis

Abstract: One challenge in applying bioinformatic tools to clinical or biological data is high number of features that might be provided to the learning algorithm without any prior knowledge on which ones should be used. In such applications, the number of features can drastically exceed the number of training instances which is often limited by the number of available samples for the study. The Lasso is one of many regularization methods that have been developed to prevent overfitting and improve prediction performance… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
21
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 26 publications
(21 citation statements)
references
References 14 publications
0
21
0
Order By: Relevance
“…While the C50 package uses a heuristic approach to select the best set of features, its default arguments does not result in optimal performance when too many features are provided. The solutions include: 1) using a Bayesian network to determine the relationships of the modules with each other and with the type of hematological malignancy (Additional file 1: Note S3) [31], 2) using a feature scoring algorithm such as FeaLect [80], and 3) adjusting the C50 parameters, for example, enforcing the number of samples in each node to be at least 10%. The first and the third solutions are implemented in the Pigengene package through the bnNum argument of the one.step.pigengene() function and the minPerLeaf argument of the make.decision.tree() function, respectively.…”
Section: Methodsmentioning
confidence: 99%
“…While the C50 package uses a heuristic approach to select the best set of features, its default arguments does not result in optimal performance when too many features are provided. The solutions include: 1) using a Bayesian network to determine the relationships of the modules with each other and with the type of hematological malignancy (Additional file 1: Note S3) [31], 2) using a feature scoring algorithm such as FeaLect [80], and 3) adjusting the C50 parameters, for example, enforcing the number of samples in each node to be at least 10%. The first and the third solutions are implemented in the Pigengene package through the bnNum argument of the one.step.pigengene() function and the minPerLeaf argument of the make.decision.tree() function, respectively.…”
Section: Methodsmentioning
confidence: 99%
“…In 2013, the FeaLect algorithm, an improvement over the Bolasso algorithm, was developed based on the combinatorial analysis of regression coefficients estimated using LARS [20]. FeaLect considers the full regularization path, and computes the feature importance using a combinatorial scoring method, as opposed to simply taking the intersection with Bolasso.…”
Section: Wrapper Methodsmentioning
confidence: 99%
“…In this paper, we review some feature selection techniques applied to the stress hotspot prediction problem in hexagonal close packed materials, and compare them with respect to future data prediction. We focus on two commonly used techniques from each method: (1) Filter Methods: Correlation based feature selection (CFS) [18], and Pearson Correlation [19]; (2) Wrapper Methods: Fealect [20] and Recursive feature elimination (RFE) [13] and (3) Embedded Methods: Random Forest Permutation accuracy importance (RF-PAI) [21] and Least Absolute Shrinkage and Selection Operator (LASSO) [22]. The main contribution of this article is to raise awareness in the materials data science community about how different feature selection techniques can lead to misguided model interpretations and how to avoid them.…”
mentioning
confidence: 99%
“…, clustering [28], probability binning [29], combinatorial gating [7], [10], and cluster matching [30], [31]); 3) supervised analysis ( e.g. , single-variate [32]–[34] and multi-variate models [35]); 4) characterization [8], [36] and visualization [16], [37], [38]. Not only do these components vary in their individual performance across different biological applications, their interactions with each other in a large data analysis pipeline also further complicates the choice of appropriate methods [39] The establishment of objective benchmarks like the one reported here enables further analysis in which different components are examined subject to a wide range of technical and biological variations to identify the combination with optimal performance.…”
Section: Discussionmentioning
confidence: 99%