Improving representations of genomic sequence motifs in convolutional networks with exponential activations

Koo, Peter K.; Ploenzke, Matt

doi:10.1038/s42256-020-00291-x

Cited by 63 publications

(37 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another potential trend is building DNNs using biophysical (Tareen and Kinney, 2019) or physicochemical properties (Yang et al, 2017;Liu et al, 2020), as deep models trained on these features might uncover novel patterns in data and lead to improved understanding of the physicochemical principles of protein-nucleic acid regulatory interactions, as well as aid model interpretability. Other novel approaches include: 1) modifying DNN properties to improve recovery of biologically meaningful motif representations (Koo and Ploenzke, 2021), 2) transformer networks (Devlin et al, 2018) and attention mechanisms (Vaswani et al, 2017), widely used in protein sequence modeling (Jurtz et al, 2017;Rao et al, 2019;Vig et al, 2020;Repecka et al, 2021), 3) graph convolutional neural networks, a class of DNNs that can work directly on graphs and take advantage of their structural information, with the potential to give us great insights if we can reframe genomics problems as graphs (Cranmer et al, 2020;Strokach et al, 2020), and 4) generative modeling (Foster, 2019), which may help exploit current knowledge in designing synthetic sequences with desired properties (Killoran et al, 2017;Wang Y. et al, 2020). With the latter, unsupervized training is used with approaches including: 1) autoencoders, which learn efficient representations of the training data, typically for dimensionality reduction (Way and Greene, 2018) or feature selection (Xie et al, 2017), 2) generative adversarial networks, which learn to generate new data with the same statistics as the training set (Wang Y. et al, 2020;Repecka et al, 2021), and 3) deep belief networks, which learn to probabilistically reconstruct their inputs, acting as feature detectors, and can be further trained with supervision to build efficient classifiers (Bu et al, 2017).…”

Section: Advantagesmentioning

confidence: 99%

Learning the Regulatory Code of Gene Expression

Zrimec

Buric

Kokina

et al. 2021

Front. Mol. Biosci.

View full text Add to dashboard Cite

Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.

show abstract

Section: Advantagesmentioning

confidence: 99%

Learning the Regulatory Code of Gene Expression

Zrimec

Buric

Kokina

et al. 2021

Front. Mol. Biosci.

View full text Add to dashboard Cite

show abstract

“…Inferring promoter motifs from convolutional kernels. We inferred promoter motifs learned by each trained model by examining the 256 kernels in the first convolutional layer, which capture such information 23 . For each kernel x, denoted by Conv1dx, we 590 generated a feature map Fx of dimension N × 5 × 1000 as the output of processing all N unique one-hot-encoded 1000-bp promoter sequences (P 1 , .…”

Section: Interpreting the Convolutional Kernelsmentioning

confidence: 99%

FUN-PROSE: A Deep Learning Approach to Predict Condition-Specific Gene Expression in Fungi

Nambiar

Dubinkina

Liu

et al. 2022

Preprint

View full text Add to dashboard Cite

mRNA levels of all genes in a genome is a critical piece of information defining the overall state of the cell in a given environmental condition. Being able to reconstruct such condition-specific expression in fungal genomes is particularly important for the task of metabolic engineering of these organisms to produce desired chemicals in industrially scalable conditions. Most of the previous deep learning approaches focused on predicting the average expression levels of a gene based on its promoter sequence, ignoring its variation across different conditions. Here we present FUN-PROSE- a deep learning model trained to predict differential expression of individual genes across various conditions using their promoter sequences and expression levels of all transcription factors.We train and test our model on three fungal species: Saccharomyces cerevisiae, Neurospora crassa and Issatchenkia orientalis and get the correlation between predicted and observed condition-specific gene expression as high as 0.85. We then interpret our model to extract promoter sequence motifs responsible for variable expression of individual genes. We also carried out input feature importance analysis to connect individual transcription factors to their gene targets. A sizeable fraction of both sequence motifs and TF-gene interactions learned by our model agree with previously known biological information, while the rest corresponds to either novel biological facts or indirect correlations.

show abstract

“…To gain insights into what DNN-based methods have learned, DLPRB visualizes filter representations while cDeepbind employs in silico mutagenesis. Filter representations are sensitive to network design choices [29,30]; ResidualBind is not designed with the intention of learning interpretable filters. Hence, we opted to employ in silico mutagenesis, which systematically probes the effect size that each possible single nucleotide mutation in a given sequence has on model predictions.…”

Section: Going Beyond In Silico Mutagenesis With Giamentioning

confidence: 99%

“…For RBPs, this has been accomplished by visualizing first convolutional layer filters and via attribution methods [13,18,23,24]. First layer filters have been shown to capture motif-like representations, but their efficacy depends highly on choice of model architecture [29], activation function [30], and training procedure [31]. First-order attribution methods, including in silico mutagenesis [13,32] and other gradient-based methods [19,[33][34][35][36], are interpretability methods that identify the independent importance of single nucleotide variants in a given sequence toward model predictions-not the effect size of extended patterns such as sequence motifs.…”

Section: Introductionmentioning

confidence: 99%

Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks

et al. 2021

Self Cite

View full text Add to dashboard Cite

Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.

show abstract

Improving representations of genomic sequence motifs in convolutional networks with exponential activations

Cited by 63 publications

References 44 publications

Learning the Regulatory Code of Gene Expression

Learning the Regulatory Code of Gene Expression

FUN-PROSE: A Deep Learning Approach to Predict Condition-Specific Gene Expression in Fungi

Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks

Contact Info

Product

Resources

About