In interphase, the human genome sequence folds in three dimensions into a rich variety of locus-specific contact patterns. Here we present a deep convolutional neural network, Akita, that accurately predicts genome folding from DNA sequence alone. Representations learned by Akita underscore the importance of CTCF and reveal a complex grammar underlying genome folding. Akita enables rapid in silico predictions for sequence mutagenesis, genome folding across species, and genetic variants.
Main textRecent research has advanced our understanding of the proteins driving and the sequences underpinning 3D genome folding in mammalian interphase, including the interplay between CTCF and cohesin 1 , and their roles in development and disease 2 . Still, while disruptions of single bases can alter genome folding, in other cases genome folding is surprisingly resilient to large-scale deletions and structural variants 3,4 . As follows, predicting the consequences of perturbing any individual CTCF site, or other regulatory element, on local genome folding remains a challenge.Previous machine learning approaches have either: (1) relied on epigenomic information as inputs 5-7 , which does not readily allow for predicting effects of DNA variants, or (2) predicted derived features of genome folding (e.g. peaks 8,9 ), which depend heavily on minor algorithmic differences 10 . Making quantitative predictions from sequence poses a substantial challenge: base pair information must be propagated to megabase scales where locus-specific patterns become salient in chromosome contact maps.Convolutional neural networks (CNNs) have emerged as powerful tools for modelling genomic data as a function of DNA sequence, directly learning DNA sequence features from the data. CNNs now make state-of-the-art predictions for transcription factor binding, DNA accessibility, transcription, and RNA-binding [11][12][13][14] . DNA sequence features learned by CNNs can be subsequently post-processed into interpretable forms 15 . Recently, Basenji 16 demonstrated that CNNs can process very long sequences (~131kb) to learn distal regulatory element influences, suggesting that genome folding could be tractable with CNNs.Here we present Akita, a deep CNN to transform input DNA sequence into predicted locusspecific genome folding. Akita takes in ~1Mb (2 20 bp) of DNA sequence and predicts contact