Molecular dynamic (MD) simulations are an important tool for studying protein aggregation processes, which play a central role in a number of diseases including Alzheimer's disease. However, MD simulations produce large amounts of data, requiring advanced methods to extract mechanistic insight into the process under study. Transition networks (TNs) provide an elegant method to identify (meta)stable states and the transitions between them from MD simulations. Here, we apply two different methods to generate TNs for protein aggregation: Markov state models (MSMs), which are based on kinetic clustering the state space, and TNs using conformational clustering. The similarities and differences of both methods are elucidated for the aggregation of the fragment Aβ 16−22 of the Alzheimer's amyloid-β peptide. In general, both methods perform excellently in identifying the main aggregation pathways. The strength of MSMs is that they provide a rather coarse and thus simply to interpret picture of the aggregation process. Conformation-sorting TNs, on the other hand, outperform MSMs in uncovering mechanistic details. We thus recommend to apply both methods to MD data of protein aggregation in order to obtain a complete picture of this process. As part of this work, a Python script called ATRANET for automated TN generation based on a correlation analysis of the descriptors used for conformational sorting is made publicly available.
Data-driven strategies are gaining increased attention in protein engineering due to recent advances in access to large experimental databanks of proteins, next-generation sequencing (NGS), high-throughput screening (HTS) methods, and the development of artificial intelligence algorithms. However, the reliable prediction of beneficial amino acid substitutions, their combination, and the effect on functional properties remain the most significant challenges in protein engineering, which is applied to develop proteins and enzymes for biocatalysis, biomedicine, and life sciences. Here, we present a general-purpose framework (PyPEF: pythonic protein engineering framework) for performing data-driven protein engineering using machine learning methods combined with techniques from signal processing and statistical physics. PyPEF guides the identification and selection of beneficial proteins of a defined sequence space by systematically or randomly exploring the fitness of variants and by sampling random evolution pathways. The performance of PyPEF was evaluated concerning its predictive accuracy and throughput on four public protein and enzyme data sets using common regression models. It was proved that the program could efficiently predict the fitness of protein sequences for different target properties (predictive models with coefficient of determination values ranging from 0.58 to 0.92). By combining machine learning and protein evolution, PyPEF enabled the screening of proteins with various functions, reaching a screening capacity of more than 500,000 protein sequence variants in the timeframe of only a few minutes on a personal computer. PyPEF displayed significant accuracies on four public data sets (different proteins and properties) and underlined the potential of integrating data-driven technologies for covering different philosophies by either predicting the fitness of the variants to the highest accuracy accounting for epistatic effects or capturing the general trend of introduced mutations on the fitness in directed protein evolution campaigns. In essence, PyPEF can provide a powerful solution to current sequence exploration and combinatorial problems faced in protein engineering through exhaustive in silico screening of the sequence space.
Protein disorder and aggregation play significant roles in the pathogenesis of numerous neurodegenerative diseases, such as Alzheimer's and Parkinson's disease. The end products of the aggregation process in these diseases are highly structured amyloid fibrils. Though in most cases small, soluble oligomers formed during amyloid aggregation are the toxic species. A full understanding of the physicochemical forces that drive protein aggregation is thus required if one aims for the rational design of drugs targeting the formation of amyloid oligomers. Among a multitude of biophysical and biochemical techniques that are employed for studying protein aggregation, molecular dynamics (MD) simulations at the atomic level provide the highest temporal and spatial resolution of this process, capturing key steps during the formation of amyloid oligomers. Here we provide a step-by-step guide for setting up, running, and analyzing MD simulations of aggregating peptides using GROMACS. For the analysis we provide the scripts that were developed in our lab, which allow to determine the oligomer size and inter-peptide contacts that drive the aggregation process. Moreover, we explain and provide the tools to derive Markov state models and transition networks from MD data of peptide aggregation.
Protein engineering through directed evolution and (semi)rational approaches has been applied successfully to optimize protein properties for broad applications in molecular biology, biotechnology, and biomedicine. The potential of protein engineering is not yet fully realized due to the limited screening throughput hampering the efficient exploration of the vast protein sequence space. Data-driven strategies have emerged as a powerful tool to leverage protein engineering by providing a model of the sequence-fitness landscape that can exhaustively be explored in silico and capitalize on the high diversity potential offered by nature. However, as both the quality and quantity of the inputted data determine the success of such approaches, the applicability of data-driven strategies is often limited due to sparse data. Here, we present a hybrid model that combines direct coupling analysis and machine learning techniques to enable data-driven protein engineering when only a few labeled sequences are available. Our method achieves high performance in predicting a proteins fitness based on its sequence regardless of the number of sequences-fitness pairs in the training dataset. Besides reducing the computational effort compared to state-of-the-art methods, it outperforms them for sparse data situations, i.e., 50-250 labeled sequences available for training. In essence, the developed method is auspicious for data-driven protein engineering, especially for protein engineers who have only access to a limited amount of data for sequence-fitness landscape modeling.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.