Virtual screening performance of support vector machines (SVM) depends on the diversity of training active and inactive compounds. While diverse inactive compounds can be routinely generated, the number and diversity of known actives are typically low. We evaluated the performance of SVM trained by sparsely distributed actives in six MDDR biological target classes composed of a high number of known actives (983-1645) of high, intermediate, and low structural diversity (muscarinic M1 receptor agonists, NMDA receptor antagonists, thrombin inhibitors, HIV protease inhibitors, cephalosporins, and renin inhibitors). SVM trained by regularly sparse data sets of 100 actives show improved yields at substantially reduced false-hit rates compared to those of published studies and those of Tanimoto-based similarity searching method based on the same data sets and molecular descriptors. SVM trained by very sparse data sets of 40 actives (2.4%-4.1% of the known actives) predicted 17.5-39.5%, 23.0-48.1%, and 70.2-92.4% of the remaining 943-1605 actives in the high, intermediate, and low diversity classes, respectively, 13.8-68.7% of which are outside the training compound families. SVM predicted 99.97% and 97.1% of the 9.997 M PUBCHEM and 167K remaining MDDR compounds as inactive and 2.6%-8.3% of the 19,495-38,483 MDDR compounds similar to the known actives as active. These suggest that SVM has substantial capability in identifying novel active compounds from sparse active data sets at low false-hit rates.
Baseline correction is one of the pre-processing steps in the analysis of metabolite signals from chemometric analytical instruments. Fully automated baseline correction techniques, although more convenient to use, tend to be less accurate than semi-automated baseline correction. A fully automated baseline correction algorithm, the automated iterative moving averaging algorithm (AIMA), is presented and compared with three recently introduced semi-automated algorithms, namely the adaptive iteratively reweighted penalized least squares (airPLS), Asymmetric Least Squares baseline correction (ALS) and a parametric method, using NMR, Raman and HPLC chromatograms. AIMA's potential in increasing the accuracy of multivariate analysis via SELTI-TOF and LCMS chromatograms was also assessed. The results show that the AIMA's accuracy is comparable to these semi-automated algorithms and has the advantage of ease of use. An AIMA plug-in for an open source metabolomics analysis tool, MZmine, was also developed. The AIMA plug-in is available at http://padel.nus.edu.sg/software/padelaima.
j1 diagnostic abilities. They can be used to predict the biological activity (e.g., IC 50 ) or class (e.g., inhibitor versus noninhibitors) of compounds before the actual biological testing. They can also be used in the analysis of structural characteristics that can give rise to the properties of interest.As illustrated in Figure 1.1, developing QSAR models starts with the collection of data for the property of interest while taking into consideration the quality of the data. It is necessary to exclude low-quality data as they will lower the quality of the model. Following that, representation of the collected molecules is done through the use of features, namely molecular descriptors, which describes important information of the molecules. There are many types of molecular descriptors but not all will be useful for a particular modeling task. Thus, uninformative or redundant molecular descriptors should be removed before the modeling process. Subsequently, for tuning and validation of the QSAR model, the full data set is divided into a training set and a testing set prior to learning.During the learning process, various modeling methods like multiple linear regression, logistic regression, and machine learning methods are used to build models that describe the empirical relationship between the structure and property of interest. The optimal model is obtained by searching for the optimal modeling parameters and feature subset simultaneously. This finalized model built from the optimal parameters will then undergo validation with a testing set to ensure that the model is appropriate and useful. Figure 1.1 General workflow of developing a QSAR model. 2j 1 Current Modeling Methods Used in QSAR/QSPR Figure 1.3 A simple structure showing the three layers of an artificial neural network. 6j 1 Current Modeling Methods Used in QSAR/QSPR Figure 1.4 GRNN is a neural network with four layers.
ChemInform is a weekly Abstracting Service, delivering concise information at a glance that was extracted from about 200 leading journals. To access a ChemInform Abstract of an article which was published elsewhere, please select a “Full Text” option. The original article is trackable via the “References” option.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.