2021
DOI: 10.48550/arxiv.2104.00673
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Cross-validation: what does it estimate and how well does it do it?

Abstract: Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population. We further show that this phenomenon occ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
35
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(35 citation statements)
references
References 44 publications
0
35
0
Order By: Relevance
“…For each emulated trial, we randomly partitioned the data into mutually exclusive training, validation and testing subsets as standard practice. All PS calculation models were trained on the same training set, and the best-estimated model was selected by following three strategies: (a) goodness-of-fit performance on the (out-of-sample) validation set, quantified by the area under the receiver operating characteristic curve (AUC) score 8,[14][15][16] ; (b) goodness-of-balance performance on the validation set, quantified by the maximum value of SMD scores over all baseline covariates after IPTW 7 ; and (c) our proposed strategy, which leverages goodness-of-balance on the training and validation combined set, and goodness-of-fit on the validation set (Method Algorithm 1). We evaluated the performance of selected models from two aspects: (i) the goodness-of-balance, which measures how similar the baseline covariates of different exposure groups are after IPTW on the whole data, and (ii) the goodness-of-fit, which measures how good the learned PS model predicts on the unseen test data (See Method Algorithm 2).…”
Section: Resultsmentioning
confidence: 99%
“…For each emulated trial, we randomly partitioned the data into mutually exclusive training, validation and testing subsets as standard practice. All PS calculation models were trained on the same training set, and the best-estimated model was selected by following three strategies: (a) goodness-of-fit performance on the (out-of-sample) validation set, quantified by the area under the receiver operating characteristic curve (AUC) score 8,[14][15][16] ; (b) goodness-of-balance performance on the validation set, quantified by the maximum value of SMD scores over all baseline covariates after IPTW 7 ; and (c) our proposed strategy, which leverages goodness-of-balance on the training and validation combined set, and goodness-of-fit on the validation set (Method Algorithm 1). We evaluated the performance of selected models from two aspects: (i) the goodness-of-balance, which measures how similar the baseline covariates of different exposure groups are after IPTW on the whole data, and (ii) the goodness-of-fit, which measures how good the learned PS model predicts on the unseen test data (See Method Algorithm 2).…”
Section: Resultsmentioning
confidence: 99%
“…The most informative prediction approach is probabilistic, aiming at specifying the full conditional distribution F Y |X . Often, one is content with point predictions, 4 modelling only a certain property or summary measure of the conditional distribution. Strictly speaking, such a summary measure is a statistical functional, mapping a distribution to a real number, such as the mean or a quantile.…”
Section: Theorymentioning
confidence: 99%
“…In resonance with conditionality for parameter inference problems, new findings are recently discovered regarding the impact of conditioning for prediction problems under a regression setup; see e.g., Rosset and Tibshirani [2020] and Bates et al [2021], where new tools are also developed for better predictive inference under conditioning. In this section, we study the evaluation of coverage probabilities under different levels of conditioning, and derive relationships between them.…”
Section: The Conditionality Problemmentioning
confidence: 99%
“…As setups (A.1) and (A.2) are coupled with P 1 and P 2 respectively, we see that treating the data as random has validity under the big data regime. On the other hand, when n is finite, P 3 can be very different from the other two probability measures; this phenomenon was recently put forth in Bates et al [2021]. Contrary to the big data regime, the finite-n case is a limited data regime where the data should be regarded as fixed observations, i.e., setup (A.3); this suggests the use of P 3 as being more appropriate for coverage evaluation with limited data.…”
Section: The Data Tpx I Y I Qu Nmentioning
confidence: 99%