2019
DOI: 10.26434/chemrxiv.9778670.v2
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Identifying Domains of Applicability of Machine Learning Models for Materials Science

Abstract: We present an extension to the usual machine learning process that allows for the identification of the domain of applicability of a fitted model, i.e., the region in its domain where it performs most accurately. This approach is applied to several vastly different but commonly used materials representations (namely the n-gram approach, SOAP, and the many body tenor representation), which are practically indistinguishable based on performance using a single error statistic. Moreover, these models appear unsati… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
7
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 32 publications
0
7
0
Order By: Relevance
“…GPR is a kernel method in which structure-property relationships are inferred from similarities between molecules. For measuring such similarities, in GPR as in any ML approach in chemistry, it is important to choose a unique and information-conserving way of translating molecular structures into ML-accessible format (molecular representations / descriptors) 34,49,55,[90][91][92][93][94][95][96][97] .…”
Section: Introductionmentioning
confidence: 99%
“…GPR is a kernel method in which structure-property relationships are inferred from similarities between molecules. For measuring such similarities, in GPR as in any ML approach in chemistry, it is important to choose a unique and information-conserving way of translating molecular structures into ML-accessible format (molecular representations / descriptors) 34,49,55,[90][91][92][93][94][95][96][97] .…”
Section: Introductionmentioning
confidence: 99%
“…Given that there are no corresponding features in the mergedpharmacophore, it is acceptable to disregard potentially missing features during prediction since no information about their contribution to activity in that location is known. Therefore, including these features at inference would only increase noise and weaken the model's con dence in the prediction (23).…”
Section: Quantitative Pharmacophore Algorithmmentioning
confidence: 99%
“…It is difficult to generate data sets that properly reflect the search space, and as such data sets may be biased and result in an overly optimistic assessment of model efficacy. 32,33 A recent analysis 1 of several supervised learning approaches to materials stability [2][3][4][5][6][7] showed that, despite being able to learn DFT formation energies with reasonable accuracy, the methods struggled to reproduce DFT decomposition energies. The difference was attributed to the observation that DFT-computed energies benefit from a systematic cancellation of errors not present in ML, that helps DFT better distinguish relative formation energies.…”
Section: Introductionmentioning
confidence: 99%