2019
DOI: 10.26434/chemrxiv.8796947
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Thinking Globally, Acting Locally: On the Issue of Training Set Imbalance and the Case for Local Machine Learning Models in Chemistry

Abstract: <div><div><div><p>The appropriate sampling of training data out of a potentially imbalanced data set is of critical importance for the development of robust and accurate machine learning models. A challenge that underpins this task is the partitioning of the data into groups of similar instances, and the analysis of the group populations. In molecular data sets, different groups of molecules may be hard to identify. However, if the distribution of a given data set is ignored then some o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 3 publications
0
8
0
Order By: Relevance
“…Chemical research is no longer an exception in this development [1], and numerous areas have been identified, in which ML is now employed to great effect (see, e.g., Refs. [2][3][4][5][6][7]). While ML applications have resulted in a number of exciting and valuable studies that have advanced chemical domain knowledge, it is worth noting that there is still a considerable lack of quality control, guidance, uniformity, and established protocols for the successful conduct of such studies.…”
Section: Assessing Machine Learning Modelsmentioning
confidence: 99%
“…Chemical research is no longer an exception in this development [1], and numerous areas have been identified, in which ML is now employed to great effect (see, e.g., Refs. [2][3][4][5][6][7]). While ML applications have resulted in a number of exciting and valuable studies that have advanced chemical domain knowledge, it is worth noting that there is still a considerable lack of quality control, guidance, uniformity, and established protocols for the successful conduct of such studies.…”
Section: Assessing Machine Learning Modelsmentioning
confidence: 99%
“…For instance, we analyze the MAE of the predictions at the tail of the RI histogram, i.e., the desired remoter areas in the RI distribution. If the prediction error in those regions of the molecular candidates is worse than the average, we fine-tune predictive models so that they are able to capture the essence of those desired underrepresented class of molecules [50]. To perform fine-tuning (FT), we carefully retrain the best model on the data points that are close to the tail of the RI distribution, i.e., those that most probably deviate from their predicted values more than the overall MAE.…”
Section: E Extrapolation To 15 Million Moleculesmentioning
confidence: 99%
“…4,5 Existing efforts implicitly assume ideal conditions during both training and testing by assuming since a large amount of relevant historical data is rarely available and the test data is typically and systematically different from the training data either through noise 6 or other changes in the distribution. 7 This is an inherent challenge in applying DL to scientific domain whose aim is to find something "new" and "different" that outperforms existing materials. It is well known that DL models are highly susceptible to such distributional shifts, which often leads to unintended and potentially harmful behavior (i.e., overconfident on predictions), especially when trained with insufficient amount of data.…”
Section: ■ Introductionmentioning
confidence: 99%
“…The majority of efforts in Material Informatics are currently devoted to training deep neural networks (DNNs) that can achieve high accuracy on holdout data set from the training distribution. , Existing efforts implicitly assume ideal conditions during both training and testing by assuming (a) having access to a sufficiently large labeled training data and (b) test data from the “same distribution” as the training set. Unfortunately, these conditions are seldomly met in Material Discovery applications since a large amount of relevant historical data is rarely available and the test data is typically and systematically different from the training data either through noise or other changes in the distribution . This is an inherent challenge in applying DL to scientific domain whose aim is to find something “new” and “different” that outperforms existing materials.…”
Section: Introductionmentioning
confidence: 99%