2020
DOI: 10.1021/acscombsci.0c00118
|View full text |Cite
|
Sign up to set email alerts
|

Cautionary Guidelines for Machine Learning Studies with Combinatorial Datasets

Abstract: Regression modeling is becoming increasingly prevalent in organic chemistry as a tool for reaction outcome prediction and mechanistic interrogation. Frequently, to acquire the requisite amount of data for such studies, researchers employ combinatorial datasets to maximize the number of data points while limiting the number of discrete chemical entities required. An often-overlooked problem in modeling studies using combinatorial datasets is the tendency to fit on patterns in the datasets (i.e., the presence or… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
36
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 38 publications
(37 citation statements)
references
References 15 publications
1
36
0
Order By: Relevance
“…Such an estimate is therefore an optimistic estimate of how the model would perform on unseen (out-of-sample) molecules. If the goal is to predict the yield for an unseen catalyst or to select the highest yielding set of reaction conditions for a new substrate, a more use-inspired test of generalization is valuable . Thus, building on a valuable exchange with Chuang and Keiser, we have turned to a different approach to estimate generalization error with leave-one-molecule out cross-validation, and advocate that the community do so as well because it is more representative of a synthetic chemists’ use.…”
Section: Model Selectionmentioning
confidence: 99%
“…Such an estimate is therefore an optimistic estimate of how the model would perform on unseen (out-of-sample) molecules. If the goal is to predict the yield for an unseen catalyst or to select the highest yielding set of reaction conditions for a new substrate, a more use-inspired test of generalization is valuable . Thus, building on a valuable exchange with Chuang and Keiser, we have turned to a different approach to estimate generalization error with leave-one-molecule out cross-validation, and advocate that the community do so as well because it is more representative of a synthetic chemists’ use.…”
Section: Model Selectionmentioning
confidence: 99%
“…To validate that the model performance using ASO (which uses multiple conformers) is superior to those developed from Steric Indicator Field (SIF) or Molecular Interaction Field 57 The primary goal of QSSR models is to making predictions for novel examples; therefore, the reliability of predictions for novel products must be assessed when training models. This assessment is best done by using out of sample predictions in test sets.…”
Section: Benchmarking and Validation Of Descriptorsmentioning
confidence: 99%
“…To validate that the model performance using ASO (which uses multiple conformers) is superior to those developed from Steric Indicator Field (SIF) or Molecular Interaction Field (MIF) descriptors (which use a single conformer), models were trained and cross-validated using 384 examples of the same data set. An external test set of 691 reactions which contained out-of-sample predictions was used for further validation. , The primary goal of QSSR models is to making predictions for novel examples; therefore, the reliability of predictions for novel products must be assessed when training models. This assessment is best done by using out of sample predictions in test sets.…”
Section: Fully Chemoinformatic-guided Workflowmentioning
confidence: 99%
“…the presence or absence of molecules) which can lead to large variations in the performance of a model depending on the train-test split of the data. 51 By splitting the data randomly, the reaction components in the test reactions will also be present in different training reactions. This type of in-sample test, where descriptors of molecules in the test reactions are already observed in training, can result in an unreliable representation of model generalisability.…”
Section: Introductionmentioning
confidence: 99%
“…52 A more appropriate assessment of model generalisability is to test models with unseen molecules, not present in training, an out-of-sample test. 51 A set of reactions containing specific molecules (one or more reaction components) are withheld from model training and used to assess the predictive ability of the trained model. It is important to ensure models are trained on reactions that cover a broad range of chemical space and observed variables.…”
Section: Introductionmentioning
confidence: 99%