2021
DOI: 10.1080/1062936x.2021.1883107
|View full text |Cite
|
Sign up to set email alerts
|

Cross-validation strategies in QSPR modelling of chemical reactions

Abstract: In this article, we consider cross-validation of the quantitative structure-property relationship models for reactions and show that the conventional k-fold crossvalidation (CV) procedure gives an 'optimistically' biased assessment of prediction performance. To address this issue, we suggest two strategies of model cross-validation, 'transformation-out' CV, and 'solvent-out' CV. Unlike the conventional k-fold cross-validation approach that does not consider the nature of objects, the proposed procedures provid… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
2
1

Relationship

1
9

Authors

Journals

citations
Cited by 16 publications
(13 citation statements)
references
References 41 publications
(64 reference statements)
0
13
0
Order By: Relevance
“…For this reason, testing “extrapolative” splits has become popular in these yield prediction tasks to gauge the value of different molecular or reaction representations. 158,159 An important caveat of these studies is that data from HTE is qualitatively different from data that is typically published. In particular, a single paper might include only a dozen substrates; combining datasets from multiple papers describing the same reaction type will lead to confounding variables like the precise choice of conditions.…”
Section: Reaction Development Goalsmentioning
confidence: 99%
“…For this reason, testing “extrapolative” splits has become popular in these yield prediction tasks to gauge the value of different molecular or reaction representations. 158,159 An important caveat of these studies is that data from HTE is qualitatively different from data that is typically published. In particular, a single paper might include only a dozen substrates; combining datasets from multiple papers describing the same reaction type will lead to confounding variables like the precise choice of conditions.…”
Section: Reaction Development Goalsmentioning
confidence: 99%
“…Rigorously splitting a dataset into training, validation and test sets is a crucial task that can be overlooked easily, and may lead to drastically wrong reported performances. 74,75 In the following, we showcase this pitfall by training a model of the QM9 target internal energy at temperatures T equal 0 K and 298 K. We treat the temperature as an input (in addition to the molecular graph), and train on the single property U (T ). The temperature is appended to the aggregated molecular embedding (after the message-passing neural network, before the feed-forward neural network).…”
Section: Test Set Contaminationmentioning
confidence: 99%
“…A plasma protein binding module was built using the graph CNN approach. ADME and toxicity predictions also guide the required changes in the existing lead to develop a potential candidate drug (Rakhimbekova et al, 2021; Tuntland et al, 2014). Knowledge about toxic substructure and chemical entities, offsite recognition, drug metabolites interaction, and drug‐drug interaction can be used to develop a holistic model for toxicity predictions.…”
Section: Approaches In Drug Discoverymentioning
confidence: 99%