Diabetes mellitus is a chronic metabolic disease, which causes an imbalance in blood glucose homeostasis and further leads to severe complications. With the increasing population of diabetes, there is an urgent need to develop drugs to treat diabetes. The development of artificial intelligence provides a powerful tool for accelerating the discovery of antidiabetic drugs. This work aims to establish a predictor called iPADD for discovering potential antidiabetic drugs. In the predictor, we used four kinds of molecular fingerprints and their combinations to encode the drugs and then adopted minimum-redundancy−maximum-relevance (mRMR) combined with an incremental feature selection strategy to screen optimal features. Based on the optimal feature subset, eight machine learning algorithms were applied to train models by using 5-fold cross-validation. The best model could produce an accuracy (Acc) of 0.983 with the area under the receiver operating characteristic curve (auROC) value of 0.989 on an independent test set. To further validate the performance of iPADD, we selected 65 natural products for case analysis, including 13 natural products in clinical trials as positive samples and 52 natural products as negative samples. Except for abscisic acid, our model can give correct prediction results. Molecular docking illustrated that quercetin and resveratrol stably bound with the diabetes target NR1I2. These results are consistent with the model prediction results of iPADD, indicating that the machine learning model has a strong generalization ability. The source code of iPADD is available at https:// github.com/llllxw/iPADD.
Background: Lipocalin belongs to the calcyin family, and its sequence length is generally between 165 and 200 residues. They are mainly stable and multifunctional extracellular proteins. Lipocalin plays an important role in several stress responses and allergic inflammations. Because the accurate identification of lipocalins could provide significant evidences for the study of their function, it is necessary to develop a machine learning-based model to recognize lipocalin. Methods: In this study, we constructed a prediction model to identify lipocalin. Their sequences were encoded by six types of features, namely amino acid composition (AAC), composition of k-spaced amino acid pairs (CKSAAP), pseudo amino acid composition (PseAAC), Geary correlation (GD), normalized Moreau-Broto autocorrelation (NMBroto) and composition/transition/distribution (CTD). Subsequently, these features were optimized by using feature selection techniques. A classifier based on random forest was trained according to the optimal features. Results: The results of 10-fold cross-validation showed that our computational model would classify lipocalins with accuracy of 95.03% and area under the curve of 0.987. On the independent dataset, our computational model could produce the accuracy of 89.90% which was 4.17% higher than the existing model. Conclusions: In this work, we developed an advanced computational model to discriminate lipocalin proteins from non-lipocalin proteins. In the proposed model, protein sequences were encoded by six descriptors. Then, feature selection was performed to pick out the best features which could produce the maximum accuracy. On the basis of the best feature subset, the RF-based classifier can obtained the best prediction results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.