2020
DOI: 10.1093/bib/bbz158
|View full text |Cite
|
Sign up to set email alerts
|

Toward a gold standard for benchmarking gene set enrichment analysis

Abstract: Motivation Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. Results We develop an extensible framework for reproducibl… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
110
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
3

Relationship

1
9

Authors

Journals

citations
Cited by 105 publications
(113 citation statements)
references
References 74 publications
1
110
0
Order By: Relevance
“…We demonstrate the accuracy, ease of use, and power of SIMON on five different biomedical datasets and build predictive models for arboviral infection severity (SISA), 42 the identification of the cellular immune signature associated with a high-level of physical activity (Cyclists), 43 the determination of the humoral responses that mediate protection against Salmonella Typhi infection (VAST), 44 early stage detection of colorectal cancer from microbiome data (Zeller), 45 , 46 and the detection of liver hepatocellular carcinoma cells (LIHC) 47 ( Figure 1 B–1E; Supplemental Information , Videos S1 and S6 ). To build models using the SISA dataset containing clinical parameters (described in the Experimental Procedures and available as Table S2 ), 11 ML algorithms were used, 5 from the original publication 42 (treebag, k nearest neighbors, random forest, stochastic generalized boosting model, and neural network) and, in addition, “sda,” shrinkage discriminant analysis; “hdda,” high-dimensional discriminant analysis; “svmLinear2,” support vector machine with linear kernel; “pcaNNet,” neural networks with feature extraction; “LogitBoost,” boosted logistic regression, and naive Bayes.…”
Section: Resultsmentioning
confidence: 99%
“…We demonstrate the accuracy, ease of use, and power of SIMON on five different biomedical datasets and build predictive models for arboviral infection severity (SISA), 42 the identification of the cellular immune signature associated with a high-level of physical activity (Cyclists), 43 the determination of the humoral responses that mediate protection against Salmonella Typhi infection (VAST), 44 early stage detection of colorectal cancer from microbiome data (Zeller), 45 , 46 and the detection of liver hepatocellular carcinoma cells (LIHC) 47 ( Figure 1 B–1E; Supplemental Information , Videos S1 and S6 ). To build models using the SISA dataset containing clinical parameters (described in the Experimental Procedures and available as Table S2 ), 11 ML algorithms were used, 5 from the original publication 42 (treebag, k nearest neighbors, random forest, stochastic generalized boosting model, and neural network) and, in addition, “sda,” shrinkage discriminant analysis; “hdda,” high-dimensional discriminant analysis; “svmLinear2,” support vector machine with linear kernel; “pcaNNet,” neural networks with feature extraction; “LogitBoost,” boosted logistic regression, and naive Bayes.…”
Section: Resultsmentioning
confidence: 99%
“…In addition to demonstrating the challenges of current approaches to hit selection, these studies also provide datasets that can be used to test whether alternative hit selection methods can improve enrichment and error correction. Various attempts have been made at developing benchmarking or synthetic datasets to evaluate the accuracy, sensitivity, and specificity of different hit selection approaches (Geistlinger et al, 2020; Mathur et al, 2018; Nguyen et al, 2019; Roder et al, 2019). The identification of a gold standard dataset by which different prioritization methods can be compared remains one of the critical challenges in bioinformatic analysis of high-throughput data (Khatri et al, 2012; Mathur et al, 2018; Mitrea et al, 2013).…”
Section: Resultsmentioning
confidence: 99%
“…The gene set enrichment analysis highlighted cytokine, interleukin, and toll-like receptor signalling pathways that are involved in regulating various aspects of innate and adaptive immune responses (90). Such results may be compromised when other pathway databases such as KEGG (91) and WikiPathways (92) are employed, as the relevant pathways and molecular interactions in the pathways are different from Reactome (93,94). To circumvent this issue, one possibility could be to extract the overlapped networks between the different databases of pathways; however, this is not always possible due to the differences in annotation of genes and…”
Section: Discussionmentioning
confidence: 99%