Motivation Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. Results We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. Availability http://bioconductor.org/packages/GSEABenchmarkeR Contact ludwig.geistlinger@sph.cuny.edu
Background: Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected data sets and biological reasoning on the relevance of resulting enriched gene sets. However, this is typically incomplete and biased towards the goals of individual investigations. Results:We present a general framework for standardized and structured benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization, and detection of relevant processes. This framework incorporates a curated compendium of 75 expression data sets investigating 42 different human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods on the benchmark compendium, identifying significant differences in (i) runtime and applicability to RNA-seq data, (ii) fraction of enriched gene sets depending on the type of null hypothesis tested, and (iii) recovery of the a priori defined relevance rankings. Based on these findings, we make practical recommendations on (i) how methods originally developed for microarray data can efficiently be applied to RNA-seq data, (ii) how to interpret results depending on the type of gene set test conducted, and (iii) which methods are best suited to effectively prioritize gene sets with high relevance for the phenotype investigated. Conclusion:We carried out a systematic assessment of existing enrichment methods, and identified best performing methods, but also general shortcomings in how gene set analysis is currently conducted. We provide a directly executable benchmark system for straightforward assessment of additional enrichment methods. Availability: http://bioconductor.org/packages/GSEABenchmarkeR
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.