2015
DOI: 10.1021/acs.jcim.5b00101
|View full text |Cite
|
Sign up to set email alerts
|

Comparing the Influence of Simulated Experimental Errors on 12 Machine Learning Algorithms in Bioactivity Modeling Using 12 Diverse Data Sets

Abstract: To date, no systematic study has assessed the effect of random experimental errors on the predictive power of QSAR models. To address this shortage, we have benchmarked the noise sensitivity of 12 learning algorithms on 12 data sets (15,840 models in total), namely the following: Support Vector Machines (SVM) with radial and polynomial (Poly) kernels, Gaussian Process (GP) with radial and polynomial kernels, Relevant Vector Machines (radial kernel), Random Forest (RF), Gradient Boosting Machines (GBM), Bagged … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

4
30
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
8
1

Relationship

4
5

Authors

Journals

citations
Cited by 31 publications
(34 citation statements)
references
References 66 publications
4
30
0
Order By: Relevance
“…Modelling the high confidence dataset led to a similar performance, with an RMSE test of 0.45 pGI 50 units and an R 2 0 test value of 0.84. This indicates that the predictive power of RF does not decrease when we included the data points measured in only one experiment, which is in agreement with a recent benchmarking study on the noise sensitivity of machine learning algorithms in bioactivity modelling ( Cortes-Ciriano et al. , 2015a ).…”
Section: Resultssupporting
confidence: 89%
“…Modelling the high confidence dataset led to a similar performance, with an RMSE test of 0.45 pGI 50 units and an R 2 0 test value of 0.84. This indicates that the predictive power of RF does not decrease when we included the data points measured in only one experiment, which is in agreement with a recent benchmarking study on the noise sensitivity of machine learning algorithms in bioactivity modelling ( Cortes-Ciriano et al. , 2015a ).…”
Section: Resultssupporting
confidence: 89%
“…7). It is also important to consider that RF models are generally robust to moderate noise levels when modeling QSAR data sets, and hence, low levels of noise are well tolerated, and, in fact, might even help to generate models robust to noisy input data [62,63]. Together, these results indicate that although the predictions generated by moderately predictive base models might be noisy, they better explain the relevant variance connecting chemical and biological space [13], and that including base models with low predictive power does not add additional predictive signal to improve the modelling of these data sets.…”
Section: Resultsmentioning
confidence: 99%
“…The comparability of multiple independent bioactivity measurements (extended here to the area of analyzing cytotoxicity data) and the influence of data quality on bioactivity modeling have received increasing attention over the last few years . Many of these studies have used data derived from the ChEMBL database, which comprises compound activities against proteins, cell lines and complex systems such as whole organisms .…”
Section: Introductionmentioning
confidence: 99%