Strong calibration is desirable for individualized decision support but unrealistic and counter productive by stimulating the development of overly complex models. Model development and external validation should focus on moderate calibration.
When developing risk prediction models on datasets with limited sample size, shrinkage methods are recommended. Earlier studies showed that shrinkage results in better predictive performance on average. This simulation study aimed to investigate the variability of regression shrinkage on predictive performance for a binary outcome. We compared standard maximum likelihood with the following shrinkage methods: uniform shrinkage (likelihood-based and bootstrap-based), penalized maximum likelihood (ridge) methods, LASSO logistic regression, adaptive LASSO, and Firth’s correction. In the simulation study, we varied the number of predictors and their strength, the correlation between predictors, the event rate of the outcome, and the events per variable. In terms of results, we focused on the calibration slope. The slope indicates whether risk predictions are too extreme (slope < 1) or not extreme enough (slope > 1). The results can be summarized into three main findings. First, shrinkage improved calibration slopes on average. Second, the between-sample variability of calibration slopes was often increased relative to maximum likelihood. In contrast to other shrinkage approaches, Firth’s correction had a small shrinkage effect but showed low variability. Third, the correlation between the estimated shrinkage and the optimal shrinkage to remove overfitting was typically negative, with Firth’s correction as the exception. We conclude that, despite improved performance on average, shrinkage often worked poorly in individual datasets, in particular when it was most needed. The results imply that shrinkage methods do not solve problems associated with small sample size or low number of events per variable.
Background:The International Ovarian Tumour Analysis (IOTA) group have developed the ADNEX (The Assessment of Different NEoplasias in the adneXa) model to predict the risk that an ovarian mass is benign, borderline, stage I, stages II–IV or metastatic. We aimed to externally validate the ADNEX model in the hands of examiners with varied training and experience.Methods:This was a multicentre cross-sectional cohort study for diagnostic accuracy. Patients were recruited from three cancer centres in Europe. Patients who underwent transvaginal ultrasonography and had a histological diagnosis of surgically removed tissue were included. The diagnostic performance of the ADNEX model with and without the use of CA125 as a predictor was calculated.Results:Data from 610 women were analysed. The overall prevalence of malignancy was 30%. The area under the receiver operator curve (AUC) for the ADNEX diagnostic performance to differentiate between benign and malignant masses was 0.937 (95% CI: 0.915–0.954) when CA125 was included, and 0.925 (95% CI: 0.902–0.943) when CA125 was excluded. The calibration plots suggest good correspondence between the total predicted risk of malignancy and the observed proportion of malignancies. The model showed good discrimination between the different subtypes.Conclusions:The performance of the ADNEX model retains its performance on external validation in the hands of ultrasound examiners with varied training and experience.
Objectives: Receiver operating characteristic (ROC) curves show how well a risk prediction model discriminates between patients with and without a condition. We aim to investigate how ROC curves are presented in the literature and discuss and illustrate their potential limitations.Study Design and Setting: We conducted a pragmatic literature review of contemporary publications that externally validated clinical prediction models. We illustrated limitations of ROC curves using a testicular cancer case study and simulated data.Results: Of 86 identified prediction modeling studies, 52 (60%) presented ROC curves without thresholds and one (1%) presented an ROC curve with only a few thresholds. We illustrate that ROC curves in their standard form withhold threshold information have an unstable shape even for the same area under the curve (AUC) and are problematic for comparing model performance conditional on threshold. We compare ROC curves with classification plots, which show sensitivity and specificity conditional on risk thresholds.Conclusion: ROC curves do not offer more information than the AUC to indicate discriminative ability. To assess the model's performance for decision-making, results should be provided conditional on risk thresholds. Therefore, if discriminatory ability must be visualized, classification plots are attractive.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.