Cross-species regulatory sequence activity prediction

Kelley, David R.

doi:10.1101/660563

Cited by 19 publications

(38 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The value of deep learning models for prioritizing rare pathogenic variants has been questioned in a recent analysis focusing on Human Gene Mutation Database (HGMD) variants 40 , meriting further investigation. Second, our analyses of allelic-effect annotations are restricted to unsigned analyses, but signed analyses have also proven valuable in linking deep learning annotations to molecular traits and complex disease 16,41,42 ; however, genome-wide signed relationships are unlikely to hold for the regulatory marks (DNase and histone marks) that we focus on here, which do not correspond to specific genes or pathways. Third, we focused here on deep learning models trained to predict specific regulatory marks, but deep learning models have also been used to predict a broader set of regulatory features, including gene expression levels and cryptic splicing 15,16,39 , that may be informative for complex disease.…”

Section: Discussionmentioning

confidence: 99%

“…Third, we focused here on deep learning models trained to predict specific regulatory marks, but deep learning models have also been used to predict a broader set of regulatory features, including gene expression levels and cryptic splicing 15,16,39 , that may be informative for complex disease. We have also not considered the application of deep learning models to TFBS, CAGE and ATAC-seq data 16,42 , which is a promising future research direction. Fourth, we focused here on deep learning models trained using human data, but models trained using data from other species may also be informative for human disease 43,42 .…”

Section: Discussionmentioning

confidence: 99%

“…We have also not considered the application of deep learning models to TFBS, CAGE and ATAC-seq data 16,42 , which is a promising future research direction. Fourth, we focused here on deep learning models trained using human data, but models trained using data from other species may also be informative for human disease 43,42 . Fifth, the forward stepwise elimination procedure that we use to identify jointly significant annotations 19 is a heuristic procedure whose choice of prioritized annotations may be close to arbitrary in the case of highly correlated annotations; nonetheless, our framework does impose rigorous criteria for conditional informativeness.…”

Section: Discussionmentioning

confidence: 99%

See 2 more Smart Citations

Evaluating the informativeness of deep learning annotations for human complex diseases

Dey

Geijn

Kim

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Deep learning models have shown great promise in predicting genome-wide regulatory effects from DNA sequence, but their informativeness for human complex diseases and traits is not fully understood. Here, we evaluate the disease informativeness of two types of deep learning annotations: (1) variant-level annotations (based on the reference allele), assessing whether they are more informative for complex disease than the underlying experimental data used to train the predictive models; and (2) allelic-effect annotations (absolute value of the predicted difference between reference and variant alleles), which have been a major focus of recent work. In each case, we primarily consider annotations constructed using two previously trained deep learning models, DeepSEA and Basenji. We apply stratified LD score regression (S-LDSC) to 41 independent diseases and complex traits (average N =320K) to evaluate each annotation's informativeness for disease heritability conditional on a broad set of coding, conserved, regulatory and LDrelated annotations from the baseline-LD model and other sources; as a secondary metric, we also evaluate the accuracy of models that incorporate deep learning annotations in predicting disease-associated or fine-mapped SNPs. We aggregated annotations across all tissues (resp. blood cell types or brain tissues) in metaanalyses across all 41 traits (resp. 11 blood-related traits or 8 brain-related traits). Variant-level annotations, despite being highly enriched for disease heritability, produced no conditionally significant results in meta-analyses across all 41 traits or 11 blood-related traits, but brain-specific DeepSEA-H3K4me3 and Basenji-H3K27ac annotations were conditionally significant in meta-analyses across 8 brain-related traits; a sequence motif analysis suggests that these annotations could be capturing unique information about nucleosome occupancy. Allelic-effect annotations were also highly enriched for disease heritability, and produced conditionally significant results for Basenji-H3K4me3 in meta-analyses across all 41 traits and brain-specific Basenji-H3K4me3 in meta-analyses across 8 brain-related traits. We conclude that deep learning models are informative for disease, but their informativeness cannot be inferred from metrics based on their accuracy in predicting regulatory annotations.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Evaluating the informativeness of deep learning annotations for human complex diseases

Dey

Geijn

Kim

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Second, we applied a 'head' that transforms the 1D representations to 2D for Hi-C prediction. We implemented the model using the Basenji software 16,17 , which is written in Tensorflow 40 and Keras 41 .…”

Section: Model Architecturementioning

confidence: 99%

“…The Akita architecture consists of a 'trunk' based on the Basenji 16,17 architecture to obtain 1D representations of genomic sequence, followed by a 'head' to transform to 2D maps of genome folding (Fig. 1a, Methods).…”

mentioning

confidence: 99%

Predicting 3D genome folding from DNA sequence

Fudenberg

Kelley

Pollard

2019

Preprint

Self Cite

View full text Add to dashboard Cite

In interphase, the human genome sequence folds in three dimensions into a rich variety of locus-specific contact patterns. Here we present a deep convolutional neural network, Akita, that accurately predicts genome folding from DNA sequence alone. Representations learned by Akita underscore the importance of CTCF and reveal a complex grammar underlying genome folding. Akita enables rapid in silico predictions for sequence mutagenesis, genome folding across species, and genetic variants. Main textRecent research has advanced our understanding of the proteins driving and the sequences underpinning 3D genome folding in mammalian interphase, including the interplay between CTCF and cohesin 1 , and their roles in development and disease 2 . Still, while disruptions of single bases can alter genome folding, in other cases genome folding is surprisingly resilient to large-scale deletions and structural variants 3,4 . As follows, predicting the consequences of perturbing any individual CTCF site, or other regulatory element, on local genome folding remains a challenge.Previous machine learning approaches have either: (1) relied on epigenomic information as inputs 5-7 , which does not readily allow for predicting effects of DNA variants, or (2) predicted derived features of genome folding (e.g. peaks 8,9 ), which depend heavily on minor algorithmic differences 10 . Making quantitative predictions from sequence poses a substantial challenge: base pair information must be propagated to megabase scales where locus-specific patterns become salient in chromosome contact maps.Convolutional neural networks (CNNs) have emerged as powerful tools for modelling genomic data as a function of DNA sequence, directly learning DNA sequence features from the data. CNNs now make state-of-the-art predictions for transcription factor binding, DNA accessibility, transcription, and RNA-binding [11][12][13][14] . DNA sequence features learned by CNNs can be subsequently post-processed into interpretable forms 15 . Recently, Basenji 16 demonstrated that CNNs can process very long sequences (~131kb) to learn distal regulatory element influences, suggesting that genome folding could be tractable with CNNs.Here we present Akita, a deep CNN to transform input DNA sequence into predicted locusspecific genome folding. Akita takes in ~1Mb (2 20 bp) of DNA sequence and predicts contact

show abstract

Machine learning approaches to identify core and dispensable genes in pangenomes

Yocca

Edger

2021

The Plant Genome

View full text Add to dashboard Cite

A gene in a given taxonomic group is either present in every individual (core) or absent in at least a single individual (dispensable). Previous pangenomic studies have identified certain functional differences between core and dispensable genes. However, identifying if a gene belongs to the core or dispensable portion of the genome requires the construction of a pangenome, which involves sequencing the genomes of many individuals. Here we aim to leverage the previously characterized core and dispensable gene content for two grass species [Brachypodium distachyon (L.) P. Beauv. and Oryza sativa L.] to construct a machine learning model capable of accurately classifying genes as core or dispensable using only a single annotated reference genome. Such a model may mitigate the need for pangenome construction, an expensive hurdle especially in orphan crops, which often lack the adequate genomic resources.

show abstract

Cross-species regulatory sequence activity prediction

Cited by 19 publications

References 42 publications

Evaluating the informativeness of deep learning annotations for human complex diseases

Evaluating the informativeness of deep learning annotations for human complex diseases

Predicting 3D genome folding from DNA sequence

Machine learning approaches to identify core and dispensable genes in pangenomes

Contact Info

Product

Resources

About