Large sets of candidate genes derived from high-throughput biological experiments can be characterized by functional enrichment analysis. The analysis consists of comparing the functions of one gene set against that of a background gene set. Then, functions related to a significant number of genes in the gene set are expected to be relevant. Web tools offering disease enrichment analysis on gene sets are often based on gene-disease associations from manually curated or experimental data that is accurate but does not cover all diseases discussed in the literature. Using associations automatically derived from literature data could be a cost effective method to improve the coverage of diseases for enrichment analysis at comparable levels of accuracy.
We have implemented a method named Gene set to Diseases, GS2D, as a web tool performing disease enrichment analysis on human protein coding gene sets. It uses an automatically built dataset of more than 63 thousand gene-disease associations defined as statistically significant co-occurrences of genes and diseases in annotations of biomedical citations from PubMed. The dataset covers more diseases for enrichment analysis than the largest comparable curated database, Comparative Toxicogenomics Database, and its performance compared favourably to similar approaches based on manually curated or experimental data. Graphical and programmatic interfaces are available at http://cbdm.uni-mainz.de/geneset2diseases.