2012
DOI: 10.1093/bib/bbs034
|View full text |Cite
|
Sign up to set email alerts
|

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Abstract: In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
284
0
1

Year Published

2014
2014
2022
2022

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 346 publications
(292 citation statements)
references
References 106 publications
2
284
0
1
Order By: Relevance
“…31 In our analysis, we considered an OTU to be highly predictive if its importance score was at least 0.001. 32 ( Figure S2). …”
Section: Resultsmentioning
confidence: 99%
“…31 In our analysis, we considered an OTU to be highly predictive if its importance score was at least 0.001. 32 ( Figure S2). …”
Section: Resultsmentioning
confidence: 99%
“…The RF algorithm builds thousands of decision trees with bootstrapped positive and negative samples and randomly selected characteristics in the input feature matrix (Breiman, 2001). This strategy can robustly reduce the influence from noise (the mislabeled positive or negative samples) and outliers (extremely high or low feature values) (Touw et al, 2013). The feature matrix submitted to the RF classifier included 12 characteristics of absolute expression values of a gene at six time points in control and stress situations, 12 characteristics of within-condition expression variations of a gene measured as z-scores at six time points in control and stress situations, six characteristics of between-condition expression changes of a gene measured as fold changes at six time points involving stress versus the control, and two characteristics of the coefficient of variation (CV) in stress and control situations (see Methods).…”
Section: Ml-based Preselection Of "Informative" Genes For Gcn Construmentioning
confidence: 99%
“…To address the above-mentioned problems, we used the RF classifier [55]. RF is a non-parametric ensemble learning classifier [55], successfully implemented in different application domains, including remote sensing [56][57][58] and data mining in life sciences [59]. For a detailed evaluation of the effectiveness of the RF classifier in the remote sensing domain, the readers might refer to [60].…”
Section: Feature Selection-rejecting Irrelevant Features and Rankingmentioning
confidence: 99%