We present the results for CAPRI Round 50, the fourth joint CASP-CAPRI protein assembly prediction challenge. The Round comprised a total of twelve targets, including six dimers, three trimers, and three higher-order oligomers. Four of these were easy targets, for which good structural templates were available either for the full assembly, or for the main interfaces (of the higher-order oligomers). Eight were
Although massive data is quickly accumulating on protein sequence and structure, there is a small and limited number of protein architectural types (or structural folds). This study is addressing the following question: how well could one reveal underlying sequence–structure relationships and design protein sequences for an arbitrary, potentially novel, structural fold? In response to the question, we have developed novel deep generative models, namely, semisupervised gcWGAN (guided, conditional, Wasserstein Generative Adversarial Networks). To overcome training difficulties and improve design qualities, we build our models on conditional Wasserstein GAN (WGAN) that uses Wasserstein distance in the loss function. Our major contributions include (1) constructing a low-dimensional and generalizable representation of the fold space for the conditional input, (2) developing an ultrafast sequence-to-fold predictor (or oracle) and incorporating its feedback into WGAN as a loss to guide model training, and (3) exploiting sequence data with and without paired structures to enable a semisupervised training strategy. Assessed by the oracle over 100 novel folds not in the training set, gcWGAN generates more successful designs and covers 3.5 times more target folds compared to a competing data-driven method (cVAE). Assessed by sequence- and structure-based predictors, gcWGAN designs are physically and biologically sound. Assessed by a structure predictor over representative novel folds, including one not even part of basis folds, gcWGAN designs have comparable or better fold accuracy yet much more sequence diversity and novelty than cVAE. The ultrafast data-driven model is further shown to boost the success of a principle-driven de novo method (RosettaDesign), through generating design seeds and tailoring design space. In conclusion, gcWGAN explores uncharted sequence space to design proteins by learning generalizable principles from current sequence–structure data. Data, source codes, and trained models are available at
We download fold data (including corresponding sequences and structure domains) from SCOPe (v. 2.07) and filtered sequences at 100% identity level. Some uncommon sequences (<2%) contain non-standard amino acid letters including 'b', 'z', 'x' and 'X'. The first three represent the ambiguity among ['d', 'n'], ['q','e'] and all 20 amino acids, respectively; and 'X' represents a gap in sequence. For simplicity, we assign explicit standard amino acids to each occasion of 'b', 'z', and 'x' following a uniform prior and disregard sequences containing 'X'. To save the computational cost under limited budget, we further filter the remaining and retained those of sequence length between 60 and 160. The length interval is chosen to balance the range (for sequence padding concerns) and the fold space coverage. Specifically, a tight range around the mode of the distribution would lead to sequences of similar lengths and reduce padding needs. Meanwhile, the range needs to be wide enough to cover the sequence and the fold space. In the end, the chosen range represents over 35% of the filtered sequences ( Fig. S1) and over 63% of the folds (781 out of 1,232).
Power grid as an important infrastructure which ensures the healthy development of economy and society and accurate and reasonable prediction of the power grid investment demand has always been the focus problem of the power planning department and the power grid enterprises. In view of the complex nonlinear and nonstationary characteristics of the power grid investment demand sequence, a novel hybrid EMD-GASVM-RBFNN forecasting model based on empirical mode decomposition (EMD) method, support vector machines optimized by genetic algorithm (GA-SVM) model, and radial basis function neural network (RBFNN) model is proposed. Firstly, the EMD method is used to decompose the original power grid investment data sequence into a series of IMF components and a residual component which have stronger regularity compared with the original data. Then, according to the different characteristics of each subsequence, the GA-SVM and RBFNN model will be used to forecast different subsequences, respectively. Next, the prediction results of different subsequences are aggregated to obtain the final prediction results of the power grid investment. Finally, this paper dynamically simulates China’s power grid investment from 2018 to 2020 based on the EMD-GASVM-RBFNN hybrid forecasting model and Monte Carlo method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.