GO2Vec: transforming GO terms and proteins to vector representations via graph embeddings

Zhong, Xiaoshi; Kaalia, Rama; Rajapakse, Jagath C.

doi:10.1186/s12864-019-6272-2

Cited by 24 publications

(23 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This tool enables the comparison of new GO-based semantic similarity measures against previously published ones considering their relation to sequence, Pfam ( 34 ) and Enzyme Commission (EC) ( 35 ) number similarity. CESSM was released in 2009 and updated in 2014, and since then it has been widely used by the community, being adopted to evaluate over 25 novel semantic similarity measures developed through different methods, with more recent ones focusing on common information content (IC)-based metrics ( 36 ) but also based on vector representations/graph embeddings ( 37 ). CESSM was built as a web-based tool to support the automatic comparison against the benchmark data.…”

Section: Related Workmentioning

confidence: 99%

A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain

Cardoso

Sousa

Köhler³

et al. 2020

Database

View full text Add to dashboard Cite

The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein–protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial. We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein–protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures. Database URL: https://github.com/liseda-lab/kgsim-benchmark.

show abstract

Section: Related Workmentioning

confidence: 99%

A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain

Cardoso

Sousa

Köhler³

et al. 2020

Database

View full text Add to dashboard Cite

show abstract

“…The edge prediction task is applied to the PPI prediction to find new protein interaction relationships. They also provide a basis for calculating protein similarity based on GO, such as GO2vec ( Zhong et al, 2019 ), which used the Node2vec algorithm to compute the functional similarity between proteins.…”

Section: Introductionmentioning

confidence: 99%

Comparative Analysis of Unsupervised Protein Similarity Prediction Based on Graph Embedding

Zhang

Wang

et al. 2021

Front. Genet.

View full text Add to dashboard Cite

The study of protein–protein interaction and the determination of protein functions are important parts of proteomics. Computational methods are used to study the similarity between proteins based on Gene Ontology (GO) to explore their functions and possible interactions. GO is a series of standardized terms that describe gene products from molecular functions, biological processes, and cell components. Previous studies on assessing the similarity of GO terms were primarily based on Information Content (IC) between GO terms to measure the similarity of proteins. However, these methods tend to ignore the structural information between GO terms. Therefore, considering the structural information of GO terms, we systematically analyze the performance of the GO graph and GO Annotation (GOA) graph in calculating the similarity of proteins using different graph embedding methods. When applied to the actual Human and Yeast datasets, the feature vectors of GO terms and proteins are learned based on different graph embedding methods. To measure the similarity of the proteins annotated by different GO numbers, we used Dynamic Time Warping (DTW) and cosine to calculate protein similarity in GO graph and GOA graph, respectively. Link prediction experiments were then performed to evaluate the reliability of protein similarity networks constructed by different methods. It is shown that graph embedding methods have obvious advantages over the traditional IC-based methods. We found that random walk graph embedding methods, in particular, showed excellent performance in calculating the similarity of proteins. By comparing link prediction experiment results from GO(DTW) and GOA(cosine) methods, it is shown that GO(DTW) features provide highly effective information for analyzing the similarity among proteins.

show abstract

“…Recently, several researchers have proposed word embeddings (e.g., word2vec [34] and GloVe [35]), which have been developed in the area of natural language processing, to learn vector representations of GO terms and proteins and then used learned vectors for the PPI prediction [36][37][38][39]. These methods mainly use the word2vec model [34] to learn vectors for each word from the corpus derived from descriptive axioms of GO terms and proteins; the descriptive axiom of a GO term is its textual description, for example, the descriptive axiom of the GO term "GO:0036388" is "pre-replicative complex assembly. "…”

mentioning

confidence: 99%

“…Finally, the vectors of proteins are used to predict the protein interactions. We have earlier proposed GO2Vec [39] that convert the GO graph into a vector space to represent genes for predicting their similarity.…”

mentioning

confidence: 99%

“…Extending our previous work [39,40], in this paper, we propose to derive graph embeddings to transform GO annotation (GOA) graph into their vector representations in order to predict missing and spurious PPI. Specifically, using GOA, our method first combines term-term relations between GO terms and term-protein annotations between GO terms and proteins, and then constructs an undirected and unweighted graph; this constructed graph is called the GOA graph.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Graph embeddings on gene ontology annotations for protein–protein interaction prediction

Zhong

Rajapakse

2020

BMC Bioinformatics

Self Cite

View full text Add to dashboard Cite

Background Protein–protein interaction (PPI) prediction is an important task towards the understanding of many bioinformatics functions and applications, such as predicting protein functions, gene-disease associations and disease-drug associations. However, many previous PPI prediction researches do not consider missing and spurious interactions inherent in PPI networks. To address these two issues, we define two corresponding tasks, namely missing PPI prediction and spurious PPI prediction, and propose a method that employs graph embeddings that learn vector representations from constructed Gene Ontology Annotation (GOA) graphs and then use embedded vectors to achieve the two tasks. Our method leverages on information from both term–term relations among GO terms and term-protein annotations between GO terms and proteins, and preserves properties of both local and global structural information of the GO annotation graph. Results We compare our method with those methods that are based on information content (IC) and one method that is based on word embeddings, with experiments on three PPI datasets from STRING database. Experimental results demonstrate that our method is more effective than those compared methods. Conclusion Our experimental results demonstrate the effectiveness of using graph embeddings to learn vector representations from undirected GOA graphs for our defined missing and spurious PPI tasks.

show abstract

GO2Vec: transforming GO terms and proteins to vector representations via graph embeddings

Cited by 24 publications

References 34 publications

A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain

A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain

Comparative Analysis of Unsupervised Protein Similarity Prediction Based on Graph Embedding

Graph embeddings on gene ontology annotations for protein–protein interaction prediction

Contact Info

Product

Resources

About