2021
DOI: 10.48550/arxiv.2108.02497
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

How to avoid machine learning pitfalls: a guide for academic researchers

Abstract: This document gives a concise outline of some of the common mistakes that occur when using machine learning techniques, and what can be done to avoid them. It is intended primarily as a guide for research students, and focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to robustly evaluate mode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
47
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 26 publications
(48 citation statements)
references
References 19 publications
1
47
0
Order By: Relevance
“…Notably, this work did generate commentary about the use of different feature sets, , whether chemical-based (as in Dreher and Doyle’s approach) or random-valued (as discussed by Chuang and Keiser). While the chemical insights and out-of-sample prediction accuracy of the initial random forest model validate the use of chemical-based descriptors in this case, incorporating control procedures and best practices into ML data analysis is crucial for those looking to use this powerful technique across the chemical sciences …”
Section: Datasets From High-throughput Experimentationmentioning
confidence: 99%
“…Notably, this work did generate commentary about the use of different feature sets, , whether chemical-based (as in Dreher and Doyle’s approach) or random-valued (as discussed by Chuang and Keiser). While the chemical insights and out-of-sample prediction accuracy of the initial random forest model validate the use of chemical-based descriptors in this case, incorporating control procedures and best practices into ML data analysis is crucial for those looking to use this powerful technique across the chemical sciences …”
Section: Datasets From High-throughput Experimentationmentioning
confidence: 99%
“…Some of the key points from an excellent review of the important factors that must be adhered to for proper ML studies 17 are highlighted here and in Table 1. The ML practitioner is encouraged to read the full review.…”
Section: Common ML Pitfalls: Dos and Don' Tsmentioning
confidence: 99%
“…The ML practitioner is encouraged to read the full review. 17 Additional points generally relevant to ML are included as well.…”
Section: Common ML Pitfalls: Dos and Don' Tsmentioning
confidence: 99%
“…Accuracy is the ratio of correct prediction to the overall cases. This metric is most suitable when the classes are balanced (Lones 2021). For instance, note that in the first unbalanced example in Table 5, The researchers should choose the metrics based on their research goal and the problem (Yao and Shepperd 2020).…”
Section: S7 Report Metricsmentioning
confidence: 99%