Statistical concerns about the GSEA procedure

Damian, Doris; Gorfine, Malka

doi:10.1038/ng0704-663a

Cited by 111 publications

(87 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…At the same time, the lack of enough experimental repeats can lead to the failure of sample permutation. Moreover, the cumulative value of statistic from the ranked gene list can cause false positives with those gene sets having large size [13].…”

Section: Introductionmentioning

confidence: 99%

Notice of Retraction: Finding Significant Gene Sets with Weighted Distribution of Gene Expression

Wang

2011

2011 5th International Conference on Bioinformatics and Biomedical Engineering

View full text Add to dashboard Cite

Gene set analysis shows great advantages of finding significant gene categories where genes are involved in relative biological processes or share similar functions. Available tools for gene set analysis are limited for the analysis of microarray experiments with few repeats and also tend to generate false positives for the gene sets containing large number of genes. We present a new method named SGS for finding significant gene sets, in which genes are differentially expressed. The methodology is based on the view that genes being more differentially expressed play more important roles in the gene expression profile. Therefore, a weighted distribution of gene expression is included to calculate the extent of up-regulation and down-regulation of the gene set. Two kinds of cutoffs are introduced to determine the gene sets which are both biological reasonable and statistical significant. Our method can effectively decrease the false positive predictions caused by the large size of gene set. To suit the analysis of microarray data with various experimental designs, including few repeats or multiple conditions, three models were proposed in SGS. The gene expression data from microarray experiments on type II diabetes was analyzed to test the performance of SGS. Under a comparison to GSEA which is one of the most widely used gene set analysis tool, it shows that SGS finds out more gene sets related to oxidative phosphoration and ribosome, and excludes gene sets which do not belong to these two properties. The assessment indicates that the new tool performs with higher accuracy and lower false positive rate.

show abstract

Section: Introductionmentioning

confidence: 99%

Notice of Retraction: Finding Significant Gene Sets with Weighted Distribution of Gene Expression

Wang

2011

2011 5th International Conference on Bioinformatics and Biomedical Engineering

View full text Add to dashboard Cite

show abstract

“…There are a few remarks to be made. Most of these have to do with the competitive nature of the competitive null, which pits each gene set against its complement in what Allison et al (2006) called a 'zero-sum game' (see also Damian and Gorfine, 2004).…”

Section: Competitive Versus Self-contained Testsmentioning

confidence: 99%

Analyzing gene expression data in terms of gene sets: methodological issues

2007

View full text Add to dashboard Cite

Motivation: Many statistical tests have been proposed in recent years for analyzing gene expression data in terms of gene sets, usually from Gene Ontology. These methods are based on widely different methodological assumptions. Some approaches test differential expression of each gene set against differential expression of the rest of the genes, whereas others test each gene set on its own. Also, some methods are based on a model in which the genes are the sampling units, whereas others treat the subjects as the sampling units. This article aims to clarify the assumptions behind different approaches and to indicate a preferential methodology of gene set testing. Results: We identify some crucial assumptions which are needed by the majority of methods. P-values derived from methods that use a model which takes the genes as the sampling unit are easily misinterpreted, as they are based on a statistical model that does not resemble the biological experiment actually performed. Furthermore, because these models are based on a crucial and unrealistic independence assumption between genes, the P-values derived from such methods can be wildly anti-conservative, as a simulation experiment shows. We also argue that methods that competitively test each gene set against the rest of the genes create an unnecessary rift between single gene testing and gene set testing. Contact: j.j.goeman@lumc.nl

show abstract

“…By considering the distribution of the gene ranks belonging to each gene set over the entire list, this method is a clear improvement over previous ones. However, the effect of the gene-set size and the influence of other gene sets not under consideration can be counterintuitive in some instances (14). Its normalization and permutation procedures also may lead to inaccurate assessment of statistical significance.…”

mentioning

confidence: 99%

Discovering statistically significant pathways in expression profiling studies

Tian

Greenberg

Kong

et al. 2005

Proc. Natl. Acad. Sci. U.S.A.

588

610

View full text Add to dashboard Cite

Accurate and rapid identification of perturbed pathways through the analysis of genome-wide expression profiles facilitates the generation of biological hypotheses. We propose a statistical framework for determining whether a specified group of genes for a pathway has a coordinated association with a phenotype of interest. Several issues on proper hypothesis-testing procedures are clarified. In particular, it is shown that the differences in the correlation structure of each set of genes can lead to a biased comparison among gene sets unless a normalization procedure is applied. We propose statistical tests for two important but different aspects of association for each group of genes. This approach has more statistical power than currently available methods and can result in the discovery of statistically significant pathways that are not detected by other methods. This method is applied to data sets involving diabetes, inflammatory myopathies, and Alzheimer's disease, using gene sets we compiled from various public databases. In the case of inflammatory myopathies, we have correctly identified the known cytotoxic T lymphocyte-mediated autoimmunity in inclusion body myositis. Furthermore, we predicted the presence of dendritic cells in inclusion body myositis and of an IFN-␣͞␤ response in dermatomyositis, neither of which was previously described. These predictions have been subsequently corroborated by immunohistochemistry.microarrays ͉ gene ontology ͉ normalization ͉ correlated data ͉ inflammatory myopathies

show abstract

Statistical concerns about the GSEA procedure

Cited by 111 publications

References 1 publication

Notice of Retraction: Finding Significant Gene Sets with Weighted Distribution of Gene Expression

Notice of Retraction: Finding Significant Gene Sets with Weighted Distribution of Gene Expression

Analyzing gene expression data in terms of gene sets: methodological issues

Discovering statistically significant pathways in expression profiling studies

Contact Info

Product

Resources

About