Support vector machines (SVMs) are among the preferred machine learning algorithms for virtual compound screening and activity prediction because of their frequently observed high performance levels. However, a well-known conundrum of SVMs (and other supervised learning methods) is the black box character of their predictions, which makes it difficult to understand why models succeed or fail. Herein we introduce an approach to rationalize the performance of SVM models based upon the Tanimoto kernel compared with the linear kernel. Model comparison and interpretation are facilitated by a visualization technique, making it possible to identify descriptor features that determine compound activity predictions. An implementation of the methodology has been made freely available.
Supervised machine learning models are widely used in chemoinformatics, especially for the prediction of new active compounds or targets of known actives. Bayesian classification methods are among the most popular machine learning approaches for the prediction of activity from chemical structure. Much work has focused on predicting structure-activity relationships (SARs) on the basis of experimental training data. By contrast, only a few efforts have thus far been made to rationalize the performance of Bayesian or other supervised machine learning models and better understand why they might succeed or fail. In this study, we introduce an intuitive approach for the visualization and graphical interpretation of naïve Bayesian classification models. Parameters derived during supervised learning are visualized and interactively analyzed to gain insights into model performance and identify features that determine predictions. The methodology is introduced in detail and applied to assess Bayesian modeling efforts and predictions on compound data sets of varying structural complexity. Different classification models and features determining their performance are characterized in detail. A prototypic implementation of the approach is provided.
Support vector machines are a popular machine learning method for many classification tasks in biology and chemistry. In addition, the support vector regression (SVR) variant is widely used for numerical property predictions. In chemoinformatics and pharmaceutical research, SVR has become the probably most popular approach for modeling of non-linear structure-activity relationships (SARs) and predicting compound potency values. Herein, we have systematically generated and analyzed SVR prediction models for a variety of compound data sets with different SAR characteristics. Although these SVR models were accurate on the basis of global prediction statistics and not prone to overfitting, they were found to consistently mispredict highly potent compounds. Hence, in regions of local SAR discontinuity, SVR prediction models displayed clear limitations. Compared to observed activity landscapes of compound data sets, landscapes generated on the basis of SVR potency predictions were partly flattened and activity cliff information was lost. Taken together, these findings have implications for practical SVR applications. In particular, prospective SVR-based potency predictions should be considered with caution because artificially low predictions are very likely for highly potent candidate compounds, the most important prediction targets.
The emerging chemical patterns (ECP) approach has been introduced for compound classification. Thus far, only very few ECP applications have been reported. Here, we further investigate the ECP methodology by studying complex classification problems. The analysis involves multi-target data sets with systematically organized subsets of compounds having distinct or overlapping target activities and, in addition, data sets containing classes of specifically active compounds with different mechanism-of-action. In systematic classification trials focusing on individual compound subsets or mechanistic classes, ECP calculations utilizing numerical descriptors achieve moderate to high sensitivity, dependent on the data set, and consistently high specificity. Accurate ECP predictions are already obtained on the basis of very small learning sets with only three positive training instances, which distinguishes the ECP approach from many other machine learning techniques.
Profiling of compounds against target families has become an important approach in pharmaceutical research for the identification of hits and analysis of selectivity and promiscuity patterns. We report on modeling of profiling experiments involving 429 potential inhibitors and a panel of 24 different kinases using support vector machine (SVM) techniques and naïve Bayesian classification. The experimental matrix contained many different activity profiles. SVM predictions achieved overall high accuracy due to consistently low false-positive and consistently high true-negative rates. However, predictions for promiscuous inhibitors were affected by false-negative rates. Combined target-based SVM classifiers reached or exceeded the performance of SVM profile prediction methods and were superior to Bayesian classification. The classifiers displayed different prediction characteristics including diverse combinations of false-positive and true-negative rates. Predicted and experimentally observed compound activity profiles were compared in detail, revealing activity patterns modeled with different accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.