Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph

Richards, Adam; Muller, Brian; Shotwell, Matthew S.; Cowart, L. Ashley; Rohrer, Bäerbel; Lu, Xinghua

doi:10.1093/bioinformatics/btq203

Cited by 23 publications

(31 citation statements)

References 39 publications

(57 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gene Ontology terms, for example, are actually organized in a tree-like hierarchy. This would be an interesting feature to explore with respect to modeling, visualization and enrichment algorithms 24 . The relationships rendered as edges in EGAN (entity-entity relationships and entity-set associations) are discrete: a relationship exists or it does not.…”

Section: Discussionmentioning

confidence: 99%

Hypergraph visualization and enrichment statistics: how the EGAN paradigm facilitates organic discovery from big data

Paquette

Tokuyasu

2011

SPIE Proceedings

View full text Add to dashboard Cite

The EGAN software is a functional implementation of a simple yet powerful paradigm for exploration of large empirical data sets downstream from computational analysis. By focusing on systems-level analysis via enrichment statistics, EGAN enables a human domain expert to transform high-throughput analysis results into hypergraph visualizations: concept maps that leverage the expert's semantic understanding of metadata and relationships to produce insight. visualization, enrichment, metadata, big data, organic intelligence, data integration, multivariate statistics, cloud computing BACKGROUND SetsHumans organize things in their environment into semantically meaningful sets. Natural language is a great example: an adjective is an annotation label that can be associated with one or more nouns; every noun X associated with adjective Y is an element of set Y. Nouns can also be sets themselves; the phrase "X is a Z" can be transformed into the logical concept "noun X is an element of set Z". These natural language principles reflect an aspect of human cognition that has persisted across millennia. In today's computational age, this process of entityto-set association has exploded into a universe of data.Consider a social network where entities are people in the network. Potential person-sets could be: hometown, current location, alma mater, current employer, first name, last name, movies or other media people like, product advertisements people have clicked on, games people play, posts people have commented on, hashtags people have used, and social contacts of one or more people; just to scratch the surface. The more broadly one expands the definition of person-sets, the richer the data describing each person-entity. This same concept applies to genomic research -where tens of thousands of genes have been annotated with tens of thousands of Gene Ontology terms hundreds of thousands of times 1 , to media libraries -where media items can be grouped and categorized (e.g. this paper has metadata tags as well as n-grams), to retail products, to companies listed on a stock exchange, to fantasy football results, etc.The actual data warehouses that store all this information may be arranged into loose, almost unstructured schemata or complex thousand-table relational database systems. The paradigm explored in this paper transforms all these models into a simple schema: 1) there are entities that are the focus of domain-specific research (e.g. people, genes, media items), 2) there are potential network connections between those entities (e.g. personal relationships, protein-protein interactions, nearest-neighbor media, hyperlinks), and 3) there are sets of entities, partitioned into set-categories (e.g. San Francisco, California as a set of people-entities is in the location setcategory, and UCSF as a set of people-entities is in the alma mater set-category; there may also exist a different set UCSF in the employer set-category).This schema is essentially a simple form of topic map 2 , where entities in this paper are e...

show abstract

Section: Discussionmentioning

confidence: 99%

Hypergraph visualization and enrichment statistics: how the EGAN paradigm facilitates organic discovery from big data

Paquette

Tokuyasu

2011

SPIE Proceedings

View full text Add to dashboard Cite

show abstract

“…where i and j indicate two GO terms, d G is the set of edge weights for all the edges in the GOGraph G , d P5 is the value of the weight of the 5th percentile, and | g i, j | is the number of genes that are shared between the GO terms i and j [26]. After augmenting the graph, we then found a Steiner tree and used its total weight to reflect the total information loss.…”

Section: Methodsmentioning

confidence: 99%

“…We use the Nadaraya-Watson non-parametric regression [26,28] to capture the non-linear relationship between the size n and the parameters of random gene sets of the same size, using the following equations:…”

Section: Methodsmentioning

confidence: 99%

Conceptualization of molecular findings by mining gene annotations

Chen

2013

BMC Proc

Self Cite

View full text Add to dashboard Cite

BackgroundThe Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner.MethodsIn this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations.ResultsWe evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.ConclusionsOur methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

show abstract

“…An important application of the GO is to perform the enrichment analysis that identifies which GO categories are over-represented on the gene sets of interest. Along this line, the use of GO graphs for coding the relationships among annotations was shown to further improve the enrichment analysis [19].…”

Section: Introductionmentioning

confidence: 99%

Analysis of Parkinson's disease pathophysiology using an integrated genomics-bioinformatics approach

Fu¹,

2015

Pathophysiology

View full text Add to dashboard Cite

Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph

Cited by 23 publications

References 39 publications

Hypergraph visualization and enrichment statistics: how the EGAN paradigm facilitates organic discovery from big data

Hypergraph visualization and enrichment statistics: how the EGAN paradigm facilitates organic discovery from big data

Conceptualization of molecular findings by mining gene annotations

Analysis of Parkinson's disease pathophysiology using an integrated genomics-bioinformatics approach

Contact Info

Product

Resources

About