While deep learning methods exist to guide protein optimization, examples of novel proteins generated with these techniques require a priori mutational data. Here we report a 3D convolutional neural network that associates amino acids with neighboring chemical microenvironments at state-of-the-art accuracy. This algorithm enables identification of novel gain-of-function mutations, and subsequent experiments confirm substantive phenotypic improvements in stability-associated phenotypes in vivo across three diverse proteins.
IntroductionProtein engineering is a transformative approach in biotechnology and biomedicine commonly used to alter natural proteins to tolerate non-native environments 1 , modify substrate specificity 2 , and improve catalytic activity 3 . Underpinning these properties is a protein's ability to fold and adopt a stable active configuration. This property is currently engineered either from sequence 4 , or energetic simulations 5 . Deep learning approaches have been reported, however these models either predict empirically measured stability effects in biased datasets containing only thousands of annotated observations 6 or require model training on the target protein 7, 8 . Recently, a 3D-CNN was trained to associate local protein microenvironments with their central amino acid 9 . Given structural data, this model was able to predict wild type amino acids at positions where destabilizing mutations had been experimentally introduced. We hypothesized that the converse might also be true: stabilizing, gain-of-function mutations could be introduced at positions where the wild-type residue is disfavored. Here, we use a deep learning algorithm to improve in vivo protein functionality several fold by introducing mutations to better align proteins with amino acidstructure relationships gleaned from the entirety of the observed proteome.
ResultsIn order to generate an algorithm that could identify unfavorable amino acid residues in virtually any protein structure, we trained a model to learn the correct association between an amino acid and its surrounding chemical environment, relying on the wealth of structures in the Protein Data Bank. We began by rebuilding the neural network architecture published by Torng and Altman with minor modifications (Fig. 1a, see Online Methods for details), replicating the reported classification accuracy of 41.2% (Fig. 1b) using the original training and testing sets (32,760 and 1601 structures, respectively) 9 . To improve the model's performance, we made several discrete changes towards more explicit biophysical annotations adding in new atomic channels for hydrogen atoms and accommodating the partial charge and solvent accessibility for each atom, increasing accuracy to 43.4% and 52.4% respectively.The selection methodology for both protein structures and amino acid residues introduced several biases to the training data. The dataset contained multiple structures of closely related proteins