High-quality gene/disease embedding in a multi-relational heterogeneous graph after a joint matrix/tensor decomposition

Zhou, Kaiyin; Zhang, Sheng; Wang, Yuxing; Cohen, K. Bretonnel; Kim, Jin-Dong; Luo, Qi; Yao, Xinzhi; Zhou, Xingyu; Xia, Jingbo

doi:10.1016/j.jbi.2021.103973

Cited by 7 publications

(4 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Notably, this area also features a different school of approaches that involve node embedding in heterogeneous networks spanning genes, diseases, chemicals, etc. Notable among these is the work by Zhou et al (2022) , which provides ∼11.2 two million putative disease–gene connections. A similar work by Yang et al (2019) used disease-related (e.g.…”

Section: Introductionmentioning

confidence: 99%

Literature mining discerns latent disease–gene relationships

Rai,

Jain,

Kumar

et al. 2024

Bioinformatics

View full text Add to dashboard Cite

Motivation Dysregulation of a gene’s function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape (HCL) project (Han et al., 2020), researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates. Results To circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained Bidirectional Encoder Representations from Transformers (BERT) for language modeling from the domain of Natural Language Processing (NLP) to learn vector representation of entities such as genes, diseases, tissues, cell-types etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in-silico synthesis of hypotheses linking different biological entities such as genes and conditions. Availability and Implementation PathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-Model BioSentVec based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-Model Pathomap R package: https://github.com/Priyadarshini-Rai/Pathomap Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Introductionmentioning

confidence: 99%

Literature mining discerns latent disease–gene relationships

Rai,

Jain,

Kumar

et al. 2024

Bioinformatics

View full text Add to dashboard Cite

show abstract

“…Node2vec 10 also converts nodes into embedding vectors, demonstrating that it is better to learn continuous feature vectors rather than constant feature vectors. Similarly, in bioinformatics, gene embedding has been used to represent genes 11 – 16 . Most of these methods have been developed for entity relationship prediction; e.g., protein–protein, protein–drug, drug–disease, and drug-side-effect interactions 14 – 16 .…”

Section: Introductionmentioning

confidence: 99%

“…Similarly, in bioinformatics, gene embedding has been used to represent genes 11 – 16 . Most of these methods have been developed for entity relationship prediction; e.g., protein–protein, protein–drug, drug–disease, and drug-side-effect interactions 14 – 16 . Because these entity relationship prediction tasks do not predict sample-specific information (e.g., drug response and survival time prediction), sample-specific datasets such as gene expression data were not used yet.…”

Section: Introductionmentioning

confidence: 99%

Molecular data representation based on gene embeddings for cancer drug response prediction

Park,

Lee

2023

Sci Rep

View full text Add to dashboard Cite

Cancer drug response prediction is a crucial task in precision medicine, but existing models have limitations in effectively representing molecular profiles of cancer cells. Specifically, when these models represent molecular omics data such as gene expression, they employ a one-hot encoding-based approach, where a fixed gene set is selected for all samples and omics data values are assigned to specific positions in a vector. However, this approach restricts the utilization of embedding-vector-based methods, such as attention-based models, and limits the flexibility of gene selection. To address these issues, our study proposes gene embedding-based fully connected neural networks (GEN) that utilizes gene embedding vectors as input data for cancer drug response prediction. The GEN allows for the use of embedding-vector-based architectures and different gene sets for each sample, providing enhanced flexibility. To validate the efficacy of GEN, we conducted experiments on three cancer drug response datasets. Our results demonstrate that GEN outperforms other recently developed methods in cancer drug prediction tasks and offers improved gene representation capabilities. All source codes are available at https://github.com/DMCB-GIST/GEN/.

show abstract

“…Heterogeneous data interaction was also modelled by Huang et al 19 for predicting transcription factor interactions with their target genes. Wang 20 and Zhou 21 instead, modelled gene-disease interactions trough a heterogenous-network model. Other recent problems solved through heterogeneous graph-based models include Gene Ontology Representation Learning, 22 Gene prioritization for rare diseases 23 and drug repurposing.…”

Section: Introductionmentioning

confidence: 99%

DeepReGraph co-clusters temporal gene expression and cis-regulatory elements through heterogeneous graph representation learning

et al. 2022

View full text Add to dashboard Cite

This work presents DeepReGraph, a novel method for co-clustering genes and cis-regulatory elements (CREs) into candidate regulatory networks. Gene expression data, as well as data from three CRE activity markers from a publicly available dataset of mouse fetal heart tissue, were used for DeepReGraph concept proofing. In this study we used open chromatin accessibility from ATAC-seq experiments, as well as H3K27ac and H3K27me3 histone marks as CREs activity markers. However, this method can be executed with other sets of markers. We modelled all data sources as a heterogeneous graph and adapted a state-of-the-art representation learning algorithm to produce a low-dimensional and easy-to-cluster embedding of genes and CREs. Deep graph auto-encoders and an adaptive-sparsity generative model are the algorithmic core of DeepReGraph. The main contribution of our work is the design of proper combination rules for the heterogeneous gene expression and CRE activity data and the computational encoding of well-known gene expression regulatory mechanisms into a suitable objective function for graph embedding. We showed that the co-clusters of genes and CREs in the final embedding shed light on developmental regulatory mechanisms in mouse fetal-heart tissue. Such clustering could not be achieved by using only gene expression data. Function enrichment analysis proves that the genes in the co-clusters are involved in distinct biological processes. The enriched transcription factor binding sites in CREs prioritize the candidate transcript factors which drive the temporal changes in gene expression. Consequently, we conclude that DeepReGraph could foster hypothesis-driven tissue development research from high-throughput expression and epigenomic data. Full source code and data are available on the DeepReGraph GitHub project.

show abstract

High-quality gene/disease embedding in a multi-relational heterogeneous graph after a joint matrix/tensor decomposition

Cited by 7 publications

References 34 publications

Literature mining discerns latent disease–gene relationships

Literature mining discerns latent disease–gene relationships

Molecular data representation based on gene embeddings for cancer drug response prediction

DeepReGraph co-clusters temporal gene expression and cis-regulatory elements through heterogeneous graph representation learning

Contact Info

Product

Resources

About