The Brazilian Compound Library (BraCoLi) is a novel open access and manually curated electronic library of compounds developed by Brazilian research groups to support further computer-aided drug design works. Herein, the first version of the database is described comprising 1,176 compounds. Also, the chemical diversity and drug-like profiles of BraCoLi were defined to analyze its chemical space. A significant amount of the compounds fitted Lipinski and Veber's rules, alongside other drug-likeness properties. A comparison using principal component analysis showed that BraCoLi is similar to other databases (FDA-approved drugs and NuBBEDB) regarding structural and physicochemical patterns. Furthermore, a scaffold analysis showed that BraCoLi presents several privileged chemical skeletons with great diversity. Despite the similar distribution in the structural and physicochemical spaces, similarity analysis indicated that compounds present in the BraCoLi are generally different from the two other databases showing an interesting innovative aspect, which is a desirable feature for novel drug design purposes.
QSAR models capable of predicting biological, toxicity, and pharmacokinetic properties were widely used to search lead bioactive molecules in chemical databases. The dataset’s preparation to build these models has a strong influence on the quality of the generated models, and sampling requires that the original dataset be divided into training (for model training) and test (for statistical evaluation) sets. This sampling can be done randomly or rationally, but the rational division is superior. In this paper, we present MASSA, a Python tool that can be used to automatically sample datasets by exploring the biological, physicochemical, and structural spaces of molecules using PCA, HCA, and K-modes. The proposed algorithm is very useful when the variables used for QSAR are not available or to construct multiple QSAR models with the same training and test sets, producing models with lower variability and better values for validation metrics. These results were obtained even when the descriptors used in the QSAR/QSPR were different from those used in the separation of training and test sets, indicating that this tool can be used to build models for more than one QSAR/QSPR technique. Finally, this tool also generates useful graphical representations that can provide insights into the data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.