QSAR models capable of predicting biological, toxicity, and pharmacokinetic properties were widely used to search lead bioactive molecules in chemical databases. The dataset’s preparation to build these models has a strong influence on the quality of the generated models, and sampling requires that the original dataset be divided into training (for model training) and test (for statistical evaluation) sets. This sampling can be done randomly or rationally, but the rational division is superior. In this paper, we present MASSA, a Python tool that can be used to automatically sample datasets by exploring the biological, physicochemical, and structural spaces of molecules using PCA, HCA, and K-modes. The proposed algorithm is very useful when the variables used for QSAR are not available or to construct multiple QSAR models with the same training and test sets, producing models with lower variability and better values for validation metrics. These results were obtained even when the descriptors used in the QSAR/QSPR were different from those used in the separation of training and test sets, indicating that this tool can be used to build models for more than one QSAR/QSPR technique. Finally, this tool also generates useful graphical representations that can provide insights into the data.
The use of computer-aided drug design has become an essential part of drug development. In this context, QSAR models capable of predicting biological activities, toxicity, and pharmacokinetic properties were widely used to search bioactive molecules in chemical databases for lead compounds. The preparation of dataset used to build these models has a strong influence on the quality of the generated models, and sampling these data requires that the original dataset be divided into training (used for model training) and test sets (used to further evaluate the predictability of the model). This sampling can be done randomly or rationally, but the rational or purposeful division is superior to the random method. In this paper, we present MASSA Algorithm ("Molecular dAta Set SAmpling Algorithm"), an open-source, user-friendly Python tool that can be used to perform automatic sampling of datasets of molecules into training and test sets by exploring the biological, physicochemical, and structural spaces of molecules using Principal Component Analysis, Hierarchical Clustering Analysis, and K-modes clustering. This proposed algorithm is very useful when the variables for QSAR model generation is not available or to construct multiple QSAR models with the same training and test sets. When compared to random sampling, the presented algorithm produces models with lower variability and higher values across multiple replicates for validation metrics. These results were obtained even when the descriptors used in the QSAR/QSPR were different from those used in the separation of training and test sets, indicating that this tool can be used to build models for more than one QSAR/QSPR technique. Finally, the tool also generates useful graphical representations of the distribution that can provide insights into the data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.