Abstract:One of the OECD principles for model validation requires defining the Applicability Domain (AD) for the QSAR models. This is important since the reliable predictions are generally limited to query chemicals structurally similar to the training compounds used to build the model. Therefore, characterization of interpolation space is significant in defining the AD and in this study some existing descriptor-based approaches performing this task are discussed and compared by implementing them on existing validated datasets from the literature. Algorithms adopted by different approaches allow defining the interpolation space in several ways, while defined thresholds contribute significantly to the extrapolations. For each dataset and approach implemented for this study, the comparison analysis was carried out by considering the model statistics and relative position of test set with respect to the training space.
BackgroundWith the growing popularity of using QSAR predictions towards regulatory purposes, such predictive models are now required to be strictly validated, an essential feature of which is to have the model’s Applicability Domain (AD) defined clearly. Although in recent years several different approaches have been proposed to address this goal, no optimal approach to define the model’s AD has yet been recognized.ResultsThis study proposes a novel descriptor-based AD method which accounts for the data distribution and exploits k-Nearest Neighbours (kNN) principle to derive a heuristic decision rule. The proposed method is a three-stage procedure to address several key aspects relevant in judging the reliability of QSAR predictions. Inspired from the adaptive kernel method for probability density function estimation, the first stage of the approach defines a pattern of thresholds corresponding to the various training samples and these thresholds are later used to derive the decision rule. Criterion deciding if a given test sample will be retained within the AD is defined in the second stage of the approach. Finally, the last stage tries reflecting upon the reliability in derived results taking model statistics and prediction error into account.ConclusionsThe proposed approach addressed a novel strategy that integrated the kNN principle to define the AD of QSAR models. Relevant features that characterize the proposed AD approach include: a) adaptability to local density of samples, useful when the underlying multivariate distribution is asymmetric, with wide regions of low data density; b) unlike several kernel density estimators (KDE), effectiveness also in high-dimensional spaces; c) low sensitivity to the smoothing parameter k; and d) versatility to implement various distances measures. The results derived on a case study provided a clear understanding of how the approach works and defines the model’s AD for reliable predictions.
Several classical and two recently proposed Applicability Domain (AD) approaches were implemented on a set of three classification models retrieved from a published study to assess the ready biodegradability of chemicals. Each model was associated with an optimal AD approach based on its ability to a) retain maximum test molecules within the model's AD, b) be appropriate for the strategy used towards model development and c) show reasonably converging results with those derived with other AD approaches used. A decision criterion was also set to evaluate the AD of two consensus models that were developed in the original study. An overview of test molecules excluded from the AD of all the five biodegradability models was provided including an attempt to identify the major structural features and molecular descriptors possibly relevant in deciding upon their ready biodegradability. Apart from the test set, an overview of the results derived on the external validation set molecules was provided.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.