The importance of 3D protein structure in proteolytic processing is well known. However, despite the plethora of existing methods for predicting proteolytic sites, only a few of them utilize the structural features of potential substrates as predictors. Moreover, to our knowledge, there is currently no method available for predicting the structural susceptibility of protein regions to proteolysis. We developed such a method using data from CutDB, a database that contains experimentally verified proteolytic events. For prediction, we utilized structural features that have been shown to influence proteolysis in earlier studies, such as solvent accessibility, secondary structure, and temperature factor. Additionally, we introduced new structural features, including length of protruded loops and flexibility of protein termini. To maximize the prediction quality of the method, we carefully curated the training set, selected an appropriate machine learning method, and sampled negative examples to determine the optimal positive-to-negative class size ratio. We demonstrated that combining our method with models of protease primary specificity can outperform existing bioinformatics methods for the prediction of proteolytic sites. We also discussed the possibility of utilizing this method for bioinformatics prediction of other post-translational modifications.
Abstract-The algorithm of the virtual database screening for the detection of proteins with the practical significance for the pharmaceutical and biotechnological industries has been developed. The Pythom programming language v. 3.6.5 in Notepad++ framework was used to develop the algorithm. The UniProt database served as a source of the information about the structure of the proteins comprising the bovine and pig lung proteome, and the open DrugBank database was used to the subsequent search for matches in the protein structures. The virtual screening allowed to detect more than 5,500 proteins which are present in the proteome of bovine and pig lungs; the assessment of the practical significance was absent in 99% of the proteins, although it resulted from the manual search in the DrugBank database that some of them were parts of drags. The algorithm also made it possible to find out target proteins for drags in the human lung proteome, which were similar with those contained in the bovine (46) and pig (84) lung proteome. Paired alignment of amino acid sequences was used to compare the human and animals' target proteins. In the end, the developed algorithm for virtual screening allowed to identify in the first approximation the proteins with practical significance that are in varying degrees included in the farm animals' lung proteome. In the future, the more detailed screening will be possible due to the algorithm optimization and use of closed databases, which will provide more complete information about practically valuable proteins for biotechnology and medicine.
proteome, database, DragBank, UniProt, virtual screening, Python, lungs
The work was carried out with financial support by Russian Foundation for Fundamental Research and the administration of the Volgograd region within the framework of the scientific project No. 18-44-343003
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.