Probing molecular specificity with deep sequencing and biophysically interpretable machine learning

H, Tomas Rube; Rastogi, Chaitanya; Feng, Siqian; Jf, Kribelbauer; A, Li; Becerra, Basheer; Lan, Melo; Bv, Do; X, Li; Hh, Adam; Nh, Shah; Rs, Mann; Hj, Bussemaker

doi:10.1101/2021.06.30.450414

Cited by 3 publications

(4 citation statements)

References 75 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Traditionally, extracting affinities from in vivo assays has focused on discovering short motifs enriched within bound loci and then using these motif-based models to predict DNA binding with somewhat limited success (105)(106)(107). By contrast, modern machine learning methods such as deep neural networks can predict binding with high accuracy but are sometimes dismissed as overparameterized black box models with no way to extract biophysical information (108). To remedy this, some studies have suggested building stereotyped networks with fixed architectures, sacrificing flexibility in modeling 13 and training to obtain nodes and weights that have explicit biophysical interpretations (109,110).…”

Section: Discussionmentioning

confidence: 99%

De novodistillation of thermodynamic affinity from deep learning regulatory sequence models ofin vivoprotein-DNA binding

Alexandari

Horton

Shrikumar

et al. 2023

Preprint

View full text Add to dashboard Cite

Transcription factors (TF) are proteins that bind DNA in a sequence-specific manner to regulate gene transcription. Despite their unique intrinsic sequence preferences, in vivo genomic occupancy profiles of TFs differ across cellular contexts. Hence, deciphering the sequence determinants of TF binding, both intrinsic and context-specific, is essential to understand gene regulation and the impact of regulatory, non-coding genetic variation. Biophysical models trained on in vitro TF binding assays can estimate intrinsic affinity landscapes and predict occupancy based on TF concentration and affinity. However, these models cannot adequately explain context-specific, in vivo binding profiles. Conversely, deep learning models, trained on in vivo TF binding assays, effectively predict and explain genomic occupancy profiles as a function of complex regulatory sequence syntax, albeit without a clear biophysical interpretation. To reconcile these complementary models of in vitro and in vivo TF binding, we developed Affinity Distillation (AD), a method that extracts thermodynamic affinities de-novo from deep learning models of TF chromatin immunoprecipitation (ChIP) experiments by marginalizing away the influence of genomic sequence context. Applied to neural networks modeling diverse classes of yeast and mammalian TFs, AD predicts energetic impacts of sequence variation within and surrounding motifs on TF binding as measured by diverse in vitro assays with superior dynamic range and accuracy compared to motif-based methods. Furthermore, AD can accurately discern affinities of TF paralogs. Our results highlight thermodynamic affinity as a key determinant of in vivo binding, suggest that deep learning models of in vivo binding implicitly learn high-resolution affinity landscapes, and show that these affinities can be successfully distilled using AD. This new biophysical interpretation of deep learning models enables high-throughput in silico experiments to explore the influence of sequence context and variation on both intrinsic affinity and in vivo occupancy.

show abstract

Section: Discussionmentioning

confidence: 99%

De novodistillation of thermodynamic affinity from deep learning regulatory sequence models ofin vivoprotein-DNA binding

Alexandari

Horton

Shrikumar

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…In practice, one of two methods is used to overcome the difficulties that gauge freedoms present. One method, called "gauge fixing", removes gauge freedoms by introducing additional constraints on model parameters (2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18).…”

Section: Introductionmentioning

confidence: 99%

“…In practice, one of two methods is typically used to overcome the difficulties that such gauge freedoms can present. One method-called "gauge fixing"-removes gauge freedoms by introducing additional constraints on model parameters (2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18). Another method limits the mathematical models that one uses to models that do not have any gauge freedoms (19)(20)(21)(22)(23)(24).…”

Section: Introductionmentioning

confidence: 99%

Symmetry, gauge freedoms, and the interpretability of sequence-function relationships

Posfai,

McCandlish,

Kinney

2024

Preprint

View full text Add to dashboard Cite

Quantitative models of sequence-function relationships, which describe how biological sequences encode functional activities, are ubiquitous in modern biology. One important aspect of these models is that they commonly exhibit gauge freedoms, i.e., directions in parameter space that do not affect model predictions. In physics, gauge freedoms arise when physical theories are formulated in ways that respect fundamental symmetries. However, the connections that gauge freedoms in models of sequence-function relationships have to the symmetries of sequence space have yet to be systematically studied. Here we study the gauge freedoms of models that respect a specific symmetry of sequence space: the group of position-specific character permutations. We find that gauge freedoms arise when the transformations of model parameters that compensate for these symmetry transformations are described by redundant irreducible matrix representations. Based on this finding, we describe an "embedding distillation" procedure that enables analytic calculation of the dimension of the space of gauge freedoms, as well as efficient computation of a sparse basis for this space. Finally, we show that the ability to interpret model parameters as quantifying allelic effects places strong constraints on the form that models can take, and in particular show that all nontrivial equivariant models of allelic effects must exhibit gauge freedoms. Our work thus advances the understanding of the relationship between symmetries and gauge freedoms in quantitative models of sequence-function relationships.

show abstract

“…Additionally, ML techniques have been employed for protein fitness prediction [3]- [5], which enables the design of proteins with specific functions or properties. They are also used for forecasting protein-ligand binding affinity [6], [7], a critical aspect of drug discovery. Moreover, large language models have been pretrained on extensive protein sequence databases [8], [9], enabling them to capture intricate sequence-structure-function relationships.…”

Section: Introductionmentioning

confidence: 99%

Protein Design by Directed Evolution Guided by Large Language Models

Tran,

2023

Preprint

View full text Add to dashboard Cite

Directed evolution, a strategy for protein engineering, optimizes protein properties (i.e., fitness) by a rigorous and resource-intensive process of screening or selecting among a vast range of mutations. By conducting anin silicoscreening of sequence properties, machine learning-guided directed evolution (MLDE) can expedite the optimization process and alleviate the experimental workload. In this work, we propose a general MLDE framework in which we apply recent advancements of Deep Learning in protein representation learning and protein property prediction to accelerate the searching and optimization processes. In particular, we introduce an optimization pipeline that utilizes Large Language Models (LLMs) to pinpoint the mutation hotspots in the sequence and then suggest replacements to improve the overall fitness. Our experiments have shown the superior efficiency and efficacy of our proposed framework in the conditional protein generation, in comparision with traditional searching algorithms, diffusion models, and other generative models. We expect this work will shed a new light on not only protein engineering but also on solving combinatorial problems using data-driven methods. Our implementation is publicly available athttps://github.com/HySonLab/Directed_Evolution.

show abstract

Probing molecular specificity with deep sequencing and biophysically interpretable machine learning

Cited by 3 publications

References 75 publications

De novodistillation of thermodynamic affinity from deep learning regulatory sequence models ofin vivoprotein-DNA binding

De novodistillation of thermodynamic affinity from deep learning regulatory sequence models ofin vivoprotein-DNA binding

Symmetry, gauge freedoms, and the interpretability of sequence-function relationships

Protein Design by Directed Evolution Guided by Large Language Models

Contact Info

Product

Resources

About