2021
DOI: 10.1021/acs.jcim.1c00503
|View full text |Cite
|
Sign up to set email alerts
|

Systematic Investigation of Error Distribution in Machine Learning Algorithms Applied to the Quantum-Chemistry QM9 Data Set Using the Bias and Variance Decomposition

Abstract: Most machine learning applications in quantumchemistry (QC) data sets rely on a single statistical error parameter such as the mean square error (MSE) to evaluate their performance. However, this approach has limitations or can even yield incorrect interpretations. Here, we report a systematic investigation of the two components of the MSE, i.e., the bias and variance, using the QM9 data set. To this end, we experiment with three descriptors, namely (i) symmetry functions (SF, with two-body and three-body func… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

3
3

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 45 publications
0
11
0
Order By: Relevance
“…In our computational experiments, we explore the QM9 and MUTAG , data sets. QM9 is a popular data set used as a benchmark in several ML algorithms for quantum chemistry applications. ,, It comprises 134,000 organic molecules composed of carbon (C), hydrogen (H), oxygen (O), nitrogen (N), and fluorine (F) with molecules containing at maximum nine heavy atoms. It also reports 15 molecular properties (e.g., geometric, energetic, electronic, and thermodynamic properties) computed with the density functional theory (DFT) B3LYP/6-31G­(2 df , p ) framework.…”
Section: Experiments Configurationmentioning
confidence: 99%
“…In our computational experiments, we explore the QM9 and MUTAG , data sets. QM9 is a popular data set used as a benchmark in several ML algorithms for quantum chemistry applications. ,, It comprises 134,000 organic molecules composed of carbon (C), hydrogen (H), oxygen (O), nitrogen (N), and fluorine (F) with molecules containing at maximum nine heavy atoms. It also reports 15 molecular properties (e.g., geometric, energetic, electronic, and thermodynamic properties) computed with the density functional theory (DFT) B3LYP/6-31G­(2 df , p ) framework.…”
Section: Experiments Configurationmentioning
confidence: 99%
“…One proposed explanation for this undermining can be related to the unavailability of docking poses in the training set [51]. The second important problem is considering better assessment of ML-based scoring functions through uncertainty quantification [52][53] and domains of applicability [54] because all newly proposed scoring functions are converging to the same performance, which makes them indistinguishable. So analyzing model test error, its uncertainty across test sets, we can spot sub-domains in which different models perform better than the others and practice this assessment for further model comparison.…”
Section: Discussionmentioning
confidence: 99%
“…Aside from the MAE, the following evaluation metrics were used to quantitatively evaluate the performance of the models: R-square ( ), mean squared error (MSE) ( Cesar de Azevedo et al, 2021 ), mean relative error (MRE) (%) ( Zhu et al, 2021a ), and ideal rate (IR) (%) ( Guo et al, 2021 ). These indices were calculated as follows: where , , and are the predicted, measured values, and the mean values, respectively.…”
Section: Methodsmentioning
confidence: 99%