Since 2011, there has been much discussion and concern about a "replication crisis" in psychology. An inability to reproduce findings in new samples can undermine even basic tenets of psychology. Much attention has been paid to the following practices, which Bishop (2019) described as "the four horsemen of the reproducibility apocalypse": Publication bias, low statistical power, phacking (Simmons et al., 2011) and HARKing (i.e., hypothesizing after the results are known; Kerr, 1998). Another practice that has received less attention is overfitting of regression models. Babyak (2004) described overfitting as "capitalizing on the idiosyncratic characteristics of the sample at hand", and argued that it results in findings that "don't really exist in the population and hence will not replicate." The following common data-analytic practices increase the likelihood of model overfitting: Having too few observations (or events) per explanatory variable (OPV/EPV); automated algorithmic selection of variables; univariable pretesting of candidate predictor variables; categorization of quantitative variables; and sequential testing of multiple confounders. We reviewed 170 recent articles from three major psychology journals and found that 96 of them included at least one of the types of regression models Babyak (2004) discussed. We reviewed more fully those 96 articles and found that they reported 286 regression models. Regarding OPV/EPV, Babyak recommended 10-15 as the minimum number needed to reduce the likelihood of overfitting. When we used the 10 OPV/EPV cutoff , 97 of the 286 models (33.9%) used at least one practice that leads to overfitting; and when we used 15 OPV/EPV as the cutoff , that number rose to 109 models (38.1%). The most frequently occurring practice that yields overfitted models was univariable pretesting of candidate predictor variables: It was found in 61 of the 286 models (21.3%). These findings suggest that overfitting of regression models remains a problem in psychology research, and that we must increase our efforts to educate researchers and students about this important issue.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.