Consensus strategies have been widely
applied in many different
scientific fields, based on the assumption that the fusion of several
sources of information increases the outcome reliability. Despite
the widespread application of consensus approaches, their advantages
in quantitative structure–activity relationship (QSAR) modeling
have not been thoroughly evaluated, mainly due to the lack of appropriate
large-scale data sets. In this study, we evaluated the advantages
and drawbacks of consensus approaches compared to single classification
QSAR models. To this end, we used a data set of three properties (androgen
receptor binding, agonism, and antagonism) for approximately 4000
molecules with predictions performed by more than 20 QSAR models,
made available in a large-scale collaborative project. The individual
QSAR models were compared with two consensus approaches, majority
voting and the Bayes consensus with discrete probability distributions,
in both protective and nonprotective forms. Consensus strategies proved
to be more accurate and to better cover the analyzed chemical space
than individual QSARs on average, thus motivating their widespread
application for property prediction. Scripts and data to reproduce
the results of this study are available for download.
The interest in multitask and deep learning strategies has been increasing in the last few years, in application to large and complex dataset for quantitative structure-activity relationship (QSAR) analysis. Multitask approaches allow the simultaneous prediction of molecular properties that are related, through information sharing, whereas deep learning strategies increase the potential of capturing nonlinear relationships. In this work, we compare the binary classification capability of multitask deep and shallow neural networks to singletask strategies used as benchmark (i.e., as k-nearest neighbours, N-nearest neighbours, random forest and Naïve Bayes), as well as multitask supervised self-organizing maps. Comparison was carried out with an extended QSAR dataset containing annotations of molecular binding, agonism and antagonism activity on 11 nuclear receptors, for a total of 14,963 molecules, divided into training and test sets and labelled for their bioactivity on at least one of 30 binary tasks. Additional 304 chemicals were used as external evaluation set to further validate models. Although no approach systematically overperformed the others, task-specific differences were found, suggesting the benefit of multitask learning for tasks that are less represented. On average, some of the single-task approaches and multitask deep learning strategies had similar performances. However, the latter can have advantages, such as a simpler management of predictions and applicability domain assessment for future samples. On the other hand, the parameter tuning required by neural networks are generally time expensive suggesting that the modelling strategy should be evaluated case by case.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.