Summary Ever increasing amounts of protein structure data, combined with advances in machine learning, has led to the rapid proliferation of methods available for protein-sequence design. In order to utilise a design method effectively, it is important to understand the nuances of its performance and how it varies by design target. Here, we present PDBench, a set of proteins and a number of standard tests for assessing the performance of sequence design methods. PDBench aims to maximise the structural diversity of the benchmark, compared with previous benchmarking sets, in order to provide useful biological insight into the behaviour of sequence-design methods, which is essential for evaluating their performance and practical utility. We believe that these tools are useful for guiding the development of novel sequence design algorithms and will enable users to choose a method that best suits their design target. Availability https://github.com/wells-wood-research/PDBench Supplementary information Supplementary data are available at Bioinformatics online.
Proteins perform critical processes in all living systems: converting solar energy into chemical energy, replicating DNA, as the basis of highly performant materials, sensing and much more. While an incredible range of functionality has been sampled in nature, it accounts for a tiny fraction of the possible protein universe. If we could tap into this pool of unexplored protein structures, we could search for novel proteins with useful properties that we could apply to tackle the environmental and medical challenges facing humanity. This is the purpose of de novo protein design. Sequence design is an important aspect of de novo protein design, and many successful methods to do this have been developed. Recently, deep-learning methods that frame it as a classification problem have emerged as a powerful approach. Beyond their reported improvement in performance, their primary advantage over physics-based methods is that the computational burden is shifted from the user to the developers, thereby increasing accessibility to the design method. Despite this trend, the tools for assessment and comparison of such models remain quite generic. The goal of this paper is to both address the timely problem of evaluation and to shine a spotlight, within the Machine Learning community, on specific assessment criteria that will accelerate impact. We present a carefully curated benchmark set of proteins and propose a number of standard tests to assess the performance of deep learning based methods. Our robust benchmark provides biological insight into the behaviour of sequencedesign methods, which is essential for evaluating their performance and practical utility. We compare five existing models with two novel models for sequence prediction. Finally, we test the designs produced by these models with Al-phaFold2, a state-of-the-art structure-prediction algorithm, to determine whether they are likely to fold into the intended 3-Dimensional shapes. BackgroundProteins are the molecules that perform almost all of the biochemical work in all living things. They have a staggering array of functionality from incredibly performant materials, like silks and wools, to some of the most efficient catalysts, capable of accelerating complex chemical reactions (Alberts et al. 2002). Beyond their roles in nature, proteins
α-Helical coiled coils are common tertiary and quaternary elements of protein structure. In coiled coils, two or more α helices wrapped around each other to form bundles. This apparently simple structural motif can generate many architectures and topologies. Understanding the variety of and limits on coiled-coil assemblies and their sequence-to-structure relationships impacts on protein structure, design, and engineering. Coiled coil-forming sequences can be predicted from heptad repeats of hydrophobic and polar residues, hpphppp, although this is not always reliable. Alternatively, coiled-coil structures can be identified using the program SOCKET, which finds knobs-into-holes (KIH) packing between side chains of neighboring helices. SOCKET also classifies coiled-coil architecture and topology, thus allowing sequence-to-structure relationships to be garnered. In 2009, we used SOCKET to create a relational database of coiled-coil structures, CC+, from the RCSB Protein Data Bank (PDB). Here we report an update of CC+ following the recent explosion of structural data and the success of AlphaFold2 in predicting protein structures from genome sequences. With the most-stringent SOCKET parameters, CC+ contains ≈12,000 coiled-coil assemblies from experimentally determined structures, and ≈120,000 potential coiled-coil structures within single-chain models predicted by AlphaFold2 across 48 proteomes. CC+ allows these and other less-stringently defined coiled coils to be searched at various levels of structure, sequence, and side-chain interactions. The identified coiled coils can be viewed directly from CC+ using the Socket2 application, and their associated data can be downloaded for further analyses. CC+ is available freely at http://coiledcoils.chm.bris.ac.uk/CCPlus/Home.html. It will be regularly updated automatically.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.