Selection of the most significant basis functions to perform spline regressions is an extremely challenging problem in quantitative structure-activity relationship studies. Normally, spline-based regression models are derived either incrementally or using genetic algorithms, and they may not provide optimal solutions. To address this issue in a systematic way, we described herein a novel variable selection method, namely, random replacement method (RRM) combining the principles of replacement methods (RMs) and genetic algorithms. We applied RRM for the selection of variables in multiple linear regression on two model data sets and showed that the method outperforms other approaches, namely, variable selection and model building using prediction and ant colony optimization. We extended the application of RRM for the selection of basis functions in spline regression, and this approach is named as random function approximation (RFA). We compared the performance of RFA with that of multivariate adaptive regression splines and genetic function approximation and demonstrated the improved performances of the proposed method and the quality of the generated models based on coefficient of determination, R 2 and Q 2 loo
Modelling of skin sensitization data of 255 diverse compounds and 450 calculated descriptors was performed to develop global predictive classification models that are applicable to whole chemical space. With this aim, we employed two automated procedures, (a) D-optimal design to select optimal members of the training and test sets and (b) k-Nearest Neighbour classification (kNN) method along with Genetic Algorithms (GA-kNN Classification) to select significant and independent descriptors in order to build the models. This methodology helped us to derive multiple models, M1-M5, that are stable and robust. The best among them, model M1 (CCR(train) = 84.3%, CCR(test) = 87.2% and CCR(ext) = 80.4%), is based on six neighbours and nine descriptors and further suggests that: (a) it is stable and robust and performs better than the reported models in literature, and (b) the combination of D-optimal design and GA-kNN classification approach is a very promising approach. Consensus prediction based on the models M1-M5 improved the CCR of training, test and external validation datasets by 3.8%, 4.45% and 3.85%, respectively, over M1. From the analysis of the physical meaning of the selected descriptors, it is inferred that the skin sensitization potential of small organic compounds can be accurately predicted using calculated descriptors that code for the following fundamental properties: (i) lipophilicity, (ii) atomic polarizability, (iii) shape, (iii) electrostatic interactions, and (iv) chemical reactivity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.