2021
DOI: 10.1038/s41598-021-87157-3
|View full text |Cite
|
Sign up to set email alerts
|

Inflated prediction accuracy of neuropsychiatric biomarkers caused by data leakage in feature selection

Abstract: In recent years, machine learning techniques have been frequently applied to uncovering neuropsychiatric biomarkers with the aim of accurately diagnosing neuropsychiatric diseases and predicting treatment prognosis. However, many studies did not perform cross validation (CV) when using machine learning techniques, or others performed CV in an incorrect manner, leading to significantly biased results due to overfitting problem. The aim of this study is to investigate the impact of CV on the prediction performan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
15
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 26 publications
(15 citation statements)
references
References 15 publications
0
15
0
Order By: Relevance
“…We also used 5-fold cross-validation to prevent overfitting; AUC= 1.0 corresponds to perfect discrimination, and AUC= 0.5 corresponds to random discrimination. Here, the principal component analysis was conducted within cross validation (Shim et al, 2021) to avoid the inaccurate estimation of performance of discrimination. AUC values were averaged among 20 trials to choose tested and evaluated data set in 5-fold cross-validation and their standard deviations (SD) were also derived.…”
Section: Discussionmentioning
confidence: 99%
“…We also used 5-fold cross-validation to prevent overfitting; AUC= 1.0 corresponds to perfect discrimination, and AUC= 0.5 corresponds to random discrimination. Here, the principal component analysis was conducted within cross validation (Shim et al, 2021) to avoid the inaccurate estimation of performance of discrimination. AUC values were averaged among 20 trials to choose tested and evaluated data set in 5-fold cross-validation and their standard deviations (SD) were also derived.…”
Section: Discussionmentioning
confidence: 99%
“…We compare the brain-PAD scores of the controls in this hold-out validation set to the MDD patients in the test set. As the validation set is not involved in the development of the brain age prediction model, the risk of overfitting is effectively prevented [64]. The application of four different machine learning algorithms allows us to further validate the consistency of the patterns observed.…”
Section: Discussionmentioning
confidence: 99%
“…The organized total data comprised 1695 datasets which were used in pipeline for the 10-fold cross-validation of the four ML models considered in this study, i.e., the logistic regression model, support vector machine, random forest model, and multilayer perceptron, to generate 30-day hospital readmission predictions by identifying the nonlinear classifying characteristic relationships among different activity-based PA parameters ( Table 1 ) and actual hospital readmissions. We adopted 10-fold cross-validation (sometimes as blocked cross-validation for timeseries splits) to prevent any data leakage [ 32 , 33 , 34 ]. The ML models were trained by a combination of supervised and reinforced learning methods, and their performance was 10-fold cross-validated using the total dataset to acquire a final trained ML model with an averaged training score.…”
Section: Methodsmentioning
confidence: 99%