2022
DOI: 10.26434/chemrxiv-2022-m8l33-v2
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Construction of balanced, chemically dissimilar training, validation and test sets for machine learning on molecular datasets

Abstract: When preparing training, validation and test sets for machine learning on molecular datasets, it is desirable to combine two requirements: 1) robustness, i.e. making a test set that is chemically dissimilar from the training set; 2) data balance, i.e. ensuring that the proportion of data points and the distribution of data labels (categorical) / data values (continuous) are as homogeneous as possible among the sets, for each individual property to model, while partitioning the overall set of compounds as requi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
13
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(15 citation statements)
references
References 0 publications
2
13
0
Order By: Relevance
“…However, on average the median R 2 is 0.4 lower and the median RMSE is 0.2 higher for the DGBC split than the random split showing that the models do not perform as well on this data split. These results are in line with expectations and previously published results 21,38 and show the importance to assess model performance with a realistic split.…”
Section: Importance Of Data Splittingsupporting
confidence: 92%
See 4 more Smart Citations
“…However, on average the median R 2 is 0.4 lower and the median RMSE is 0.2 higher for the DGBC split than the random split showing that the models do not perform as well on this data split. These results are in line with expectations and previously published results 21,38 and show the importance to assess model performance with a realistic split.…”
Section: Importance Of Data Splittingsupporting
confidence: 92%
“…Dissimilarity-driven balanced cluster (DGBC) split was made by using a method developed by Tricarico et al 21 First the compounds in the dataset were clustered using sphere exclusion clustering on ECFP6 fingerprints with a Tanimoto distance of 0.736 between cluster centroids. 32 Fingerprint generation and sphere exclusion clustering were done using RDKit (version 2020.09.05).…”
Section: Data Set Creationmentioning
confidence: 99%
See 3 more Smart Citations