Abstract:The need for careful assembly, training, and validation of quantitative structure−activity/property models (QSAR/QSPR) is more significant than ever as data sets become larger and sophisticated machine learning tools become increasingly ubiquitous and accessible to the scientific community. Regulatory agencies such as the United States Environmental Protection Agency must carefully scrutinize each aspect of a resulting QSAR/QSPR model to determine its potential use in environmental exposure and hazard assessme… Show more
“…Our study corroborates the findings of Lowe et al . 29 , emphasizing the complexity and challenges in solubility prediction across diverse chemical spaces. We found that RF models provide a balanced and interpretable framework.…”
Section: Discussionmentioning
confidence: 99%
“… 28 , and Lowe et al . 29 . In comparison, AqSolDB which was published in 2020 has already been used in 2021 by Francoeur et al .…”
Section: Introductionmentioning
confidence: 99%
“… 33 , and in 2023 by Lowe et al . 29 . AqSolDB is one of the largest publicly accessible set with 9,982 entries.…”
Accurate prediction of thermodynamic solubility by machine learning remains a challenge. Recent models often display good performances, but their reliability may be deceiving when used prospectively. This study investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the aqueous solubility dataverse and data quality. We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets. We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist. Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources. We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.
“…Our study corroborates the findings of Lowe et al . 29 , emphasizing the complexity and challenges in solubility prediction across diverse chemical spaces. We found that RF models provide a balanced and interpretable framework.…”
Section: Discussionmentioning
confidence: 99%
“… 28 , and Lowe et al . 29 . In comparison, AqSolDB which was published in 2020 has already been used in 2021 by Francoeur et al .…”
Section: Introductionmentioning
confidence: 99%
“… 33 , and in 2023 by Lowe et al . 29 . AqSolDB is one of the largest publicly accessible set with 9,982 entries.…”
Accurate prediction of thermodynamic solubility by machine learning remains a challenge. Recent models often display good performances, but their reliability may be deceiving when used prospectively. This study investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the aqueous solubility dataverse and data quality. We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets. We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist. Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources. We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.
“…The tasks and problems spanned by the proposed methods are also highly diverse and range from classical quantitative structure–activity/property relationship (QSAR/QSPR) problems, such as the prediction of drug-induced liver injury (DILI), , Ames mutagenicity, , or acetylcholinesterase (AChE) inhibition, to categorization of chemicals, generating transcriptomic profiles and classifying parts of regulatory documents . In their paper, “Transparency in Modeling through Careful Application of OECD’s QSAR/QSPR Principles via a Curated Water Solubility Data Set”, the authors revisited the five Organisation for Economic Co-operation and Development (OECD) principles for QSAR models. The discussion of these principles, which is ongoing also within OECD, is addressed here with particular emphasis on the case of machine learning (ML) models.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.