2023
DOI: 10.26434/chemrxiv-2023-00vcg-v2
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Characterizing Uncertainty in Machine Learning for Chemistry

Abstract: Characterizing uncertainty in machine learning models has recently gained interest in the context of machine learning reliability, robustness, safety, and active learning. Here, we separate the total uncertainty into contributions from noise in the data (aleatoric) and shortcomings of the model (epistemic), further dividing epistemic uncertainty into model bias and variance contributions. We systematically address the influence of noise, model bias, and model variance in the context of chemical property predic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
15
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 17 publications
(23 citation statements)
references
References 62 publications
0
15
0
Order By: Relevance
“…For candidates with an experimental plan, their dye-like properties were predicted by using a set of Chemprop (14,15) property prediction models: wavelength of maximum absorption, partition coefficient, and photooxidative degradation rate. Chemprop is lightweight and fast, so an ensemble of models can be automatically retrained from the whole dataset with each batch of experimental data (supplementary materials, SM4) (14)(15)(16)41). The ensemble variance is used as a proxy for model uncertainty to further inform molecule selection.…”
Section: Front-end Predictionsmentioning
confidence: 99%
See 1 more Smart Citation
“…For candidates with an experimental plan, their dye-like properties were predicted by using a set of Chemprop (14,15) property prediction models: wavelength of maximum absorption, partition coefficient, and photooxidative degradation rate. Chemprop is lightweight and fast, so an ensemble of models can be automatically retrained from the whole dataset with each batch of experimental data (supplementary materials, SM4) (14)(15)(16)41). The ensemble variance is used as a proxy for model uncertainty to further inform molecule selection.…”
Section: Front-end Predictionsmentioning
confidence: 99%
“…This work focuses on organic small molecules that must manifest multiple distinct properties simultaneously, constraining molecular design and increasing structural complexity. Emerging predictive tools can quickly generate new candidate molecules (8)(9)(10)(11)(12)(13), predict the performance of candidates (11,(14)(15)(16), and propose practical reaction pathways (17-21); meanwhile, chemical automation can now dependably conduct experiments with minimal human intervention after an initial setup phase. Integrating generative algorithms, computer-aided synthesis planning (CASP), iteratively updated large datasets, and automated chemical synthesis, purification, and characterization for each step of the design-make-testanalyze (DMTA) cycle all into a single workflow could improve experiment efficiency and ultimately enable autonomous chemical discovery.…”
mentioning
confidence: 99%
“…When creating training, validation, and testing sets, we use five folds, each with an 85:5:10 split. Our results report the mean and standard deviation of each fold to give some sense of model uncertainty; future work could more rigorously examine uncertainty estimation [159][160][161][162][163][164][165][166][167][168], though it is sometimes unclear which method is best [169]. We use both random splitting and scaffold splitting on the reactant SMILES, which partitions the data based on Bemis-Murcko scaffold splits [70], as calculated by RDKit [170].…”
Section: Datasetmentioning
confidence: 99%
“…62 Other scripts necessary to train the models analyzed in this work and recreate the results are provided through GitHub (https://github.com/cjmcgill/ characterizing_uncertainty_scripts). 63…”
Section: ■ Introductionmentioning
confidence: 99%