Deep Generative Models of Protein Structure Uncover Distant Relationships Across a Continuous Fold Space

Draizen, Eli J.; Veretnik, Stella; Mura, Cameron; Bourne, Philip E.

doi:10.1101/2022.07.29.501943

Cited by 7 publications

(21 citation statements)

References 92 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We first investigated the SH3-specific (CATH 2.30.30.100), DeepUrfold-derived VAE model. This model was trained using all energy-minimized domain structures from the SH3 superfamily, along with hand-crafted biophysical features, as described in [17]. We first attempted to subject representative SH3 domains through the SH3 model and calculated relevance scores during backpropagation.…”

Section: Resultsmentioning

confidence: 99%

“…In a recent paper that introduced a DeepUrfold framework, the authors developed: (i) a preprocessed dataset, based on CATH superfamilies, that includes biophysical properties for each atom along with energy-minimized domain structures; and (ii) superfamily-specific sparse 3D-CNN VAEs [17]. Energy-minimized domain structures from a single superfamily were voxelized using a k D-tree to map atoms into 1Å 3 voxels in an overall 264 3 Å 3 discretized volume; 3D structural models were rotated randomly by sampling the SO (3) group to train a VAE model, modified from [19, 20], yielding superfamily-specific models.…”

Section: Methodsmentioning

confidence: 99%

“…For a given structure, residues containing any atom lying in the ≥ 75 th percentile were extracted to create a set of 53,480 (dis-)continuous fragments. For each community identified with Stochastic Block Modelling of the bipartite graph formed from the all superf amilies × all domains approach [17], we used the foldseek code [25] to cluster all LRP structures from domains present in each community that were processed through all superfamilies represented by the community (using TM-Align for global alignment). We select the LRP structure cluster representative from the most populated cluster in each community, resulting in a final list of top-20 "potential urfolds".…”

Section: Cross-model Fragment Identificationmentioning

confidence: 99%

“…All of those findings were consistent with the prediction that the SH3 and OB comprise a distinct urfold (in this case, the SBB). This paper explores-and seeks to begin explaining-the models from [17] in more depth, by applying an approach known as layer-wise relevance propagation (LRP). In principle, explainable AI techniques such as LRP can be used to understand which atoms in the input structure are 'important', based on their spatial locations and biophysical properties (and, really, any other sorts of features that one encodes in the model).…”

Section: Introductionmentioning

confidence: 99%

“…In a recent study that developed a deep generative approach to protein structural relationships, using the Urfold model of protein structure in a framework called DeepUrfold , 20 superfamily-specific, sparse 3D-CNN variational autoencoders (VAEs) were trained for 20 different, highly-populated CATH superfamilies [17]. These DeepUrfold-trained models were shown to be agnostic to topology, as architecturally-similar SH3/OB proteins with artificially-constructed loop permutations yielded similar evidence lower bound–based (ELBO) scores; most significantly, applying community-detection methods (as stochastic block models) to the patterns of ELBO similarities led to the SH3 and OB domains clustering into similar groupings (with some intermixing).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Explainable Deep Generative Models, Ancestral Fragments, and Murky Regions of the Protein Structure Universe

Draizen

Mura

Bourne

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Modern proteins did not arise abruptly, as singular events, but rather over the course of at least 3.5 billion years of evolution. Can machine learning teach us how this occurred? The molecular evolutionary processes that yielded the intricate three-dimensional (3D) structures of proteins involve duplication, recombination and mutation of genetic elements, corresponding to short peptide fragments. Identifying and elucidating these ancestral fragments is crucial to deciphering the interrelationships amongst proteins, as well as how evolution acts upon protein sequences, structures & functions. Traditionally, structural fragments have been found using sequence-based and 3D structural alignment approaches, but that becomes challenging when proteins have undergone extensive permutations- allowing two proteins to share a common architecture, though their topologies may drastically differ (a phenomenon termed the Urfold). We have designed a new framework to identify compact, potentially-discontinuous peptide fragments by combining (i) deep generative models of protein superfamilies with (ii) layer-wise relevance propagation (LRP) to identify atoms of great relevance in creating an embedding during an all superfamilies by all domains analysis. Our approach recapitulates known relationships amongst the evolutionarily ancient small beta-barrels (e.g. SH3 and OB folds) and amongst P-loop-containing proteins (e.g. Rossmann and P-loop NTPases), previously established via manual analysis. Because of the generality of our deep model's approach, we anticipate that it can enable the discovery of new ancestral peptides. In a sense, our framework uses LRP as an 'explainable AI' approach, in conjunction with a recent deep generative model of protein structure (termed DeepUrfold), in order to leverage decades worth of structural biology knowledge to decipher the underlying molecular bases for protein structural relationships-including those which are exceedingly remote, yet discoverable via deep learning.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Cross-model Fragment Identificationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Explainable Deep Generative Models, Ancestral Fragments, and Murky Regions of the Protein Structure Universe

Draizen

Mura

Bourne

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

How AlphaFold2 shaped the structural coverage of the human transmembrane proteome

Jambrich,

Tusnady,

Dobson

2023

Sci Rep

View full text Add to dashboard Cite

AlphaFold2 (AF2) provides a 3D structure for every known or predicted protein, opening up new prospects for virtually every field in structural biology. However, working with transmembrane protein molecules pose a notorious challenge for scientists, resulting in a limited number of experimentally determined structures. Consequently, algorithms trained on this finite training set also face difficulties. To address this issue, we recently launched the TmAlphaFold database, where predicted AlphaFold2 structures are embedded into the membrane plane and a quality assessment (plausibility of the membrane-embedded structure) is provided for each prediction using geometrical evaluation. In this paper, we analyze how AF2 has improved the structural coverage of membrane proteins compared to earlier years when only experimental structures were available, and high-throughput structure prediction was greatly limited. We also evaluate how AF2 can be used to search for (distant) homologs in highly diverse protein families. By combining quality assessment and homology search, we can pinpoint protein families where AF2 accuracy is still limited, and experimental structure determination would be desirable.

show abstract

Prop3D: A Flexible, Python-based Platform for Machine Learning with Protein Structural Properties and Biophysical Data

Draizen

Murillo

Readey

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be of utility, such datasets must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently far more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing. Here, we report Prop3D, a protein biophysical and evolutionary featurization and data-processing pipeline that we have developed and deployed—both in the cloud and on local HPC resources—in order to systematically and reproducibly create comprehensive datasets, using the Highly Scalable Data Service (HSDS). Prop3D and its associated 'Prop3D-20sf' dataset can be of broader utility, as a community-wide resource, for other structure-related workflows, particularly for tasks that arise at the intersection of deep learning and classical structural bioinformatics.

show abstract

Deep Generative Models of Protein Structure Uncover Distant Relationships Across a Continuous Fold Space

Cited by 7 publications

References 92 publications

Explainable Deep Generative Models, Ancestral Fragments, and Murky Regions of the Protein Structure Universe

Explainable Deep Generative Models, Ancestral Fragments, and Murky Regions of the Protein Structure Universe

How AlphaFold2 shaped the structural coverage of the human transmembrane proteome

Prop3D: A Flexible, Python-based Platform for Machine Learning with Protein Structural Properties and Biophysical Data

Contact Info

Product

Resources

About