Recent implementations of QSAR modeling software provide the user with numerous models and a wealth of information. In this work, we provide some guidance on how one should interpret the results of QSAR modeling, compare and assess the resulting models and select the best and most consistent ones. Two QSAR datasets are applied as case studies for the comparison of model performance parameters and model selection methods. We demonstrate the capabilities of sum of ranking differences (SRD) in model selection and ranking and identify the best performance indicators and models. While the exchange of the original training and (external) test sets does not affect the ranking of performance parameters, it provides improved models in certain cases (despite the lower number of molecules in the training set). Performance parameters for external validation are substantially separated from the other merits in SRD analyses, highlighting their value in data fusion.
IntroductionModel comparison and selection of the best one is an evergreen among scientific investigations. The process is contradictory: bias-variance trade-off, local minima, searching for robust models, the principle of parsimony, etc.; all ideas consider various models inherently. One model is better from one point of view, the other should be better from another point of view. Even if one fixes the aim (and algorithm) according to various criteria: R 2 , Q 2 , Mallows Cp, Akaike Information criterion, Bayesian information criterion, etc., their application on the training, validation and test sets will necessarily provide different models for description of existing data and for prediction of future samples. The case is even more complicated with the fact that we deal with random effects: i.e. it is relatively easy to find conditions where one of the models is clearly superior compared to other models. Many authors select instinctively or deliberately such datasets, splits, etc. for which their own descriptor selection or model building algorithm performs better than the rival approaches.Kalivas et al. suggested selecting harmonious models taking into account the bias-variance trade-off: it is difficult and not unambiguous to find the 'best' model. A biased model provides less variance and vice versa. However, harmonious models are not necessarily parsimonious [1]. The scope of the methodology has recently been extended with the idea of sum of ranking differences (SRD) for partial least squares and ridge regression models [2].Principal-component analysis (PCA) has been applied by Geladi [3,4] and Todeschini et al. [5] to find the best and worst regression and classification models, respectively. PCAs were completed on a matrix of regression vectors and dominant patterns (grouping, outliers) could be detected among the models. The interpretation of PCA results is easy: principal component 1 marks the direction of the best and worst regression models. Principal component 2 reflects various behaviors of the regression models on various datasets. The models lyin...