Background
Large-scale international studies offer researchers a rich source of data to examine the relationship among variables. Machine learning embodies a range of flexible statistical procedures to identify key indicators of a response variable among a collection of hundreds or even thousands of potential predictor variables. Among these, penalized regression approaches, including least absolute selection and shrinkage operator (LASSO) and elastic net (Enet), have been advanced as useful tools capable of handling large number of predictors for variable selection for model generation. While the utility of penalized regression within educational research is emerging, less application of these machine learning methods, including random forest, to predictor variable selection in large-scale international data appears in the literature. In response, this study compared LASSO, Enet, and random forest for predictor variable selection, including the traditional forward stepwise (FS) regression approach, for students’ test anxiety or, more specifically, schoolwork-related anxiety based on PISA 2015 data.
Methods
Prediction of the three machine learning methods were compared for variable selection of 188 indicators of schoolwork-related anxiety. Data were based on US students (N = 5593) who participated in PISA 2015. With the exception of FS, LASSO, Enet, and random forest were iterated 100 times to consider the bias resulting from data-splitting to determine the selection or non-selection of each predictor. This resulted in the reporting of number of selected variables into the following five count categories: 1 or more, 25 or more, 50 or more, 75 or more, and all 100 iterations.
Results
LASSO and Enet both outperformed random forest but did not differ from one another in terms of prediction performance in 100 iterations of modeling. Correspondingly, LASSO was compared to FS in which, of the 188 predictors, 27 were identified as key indicators of schoolwork-related anxiety across 100 iterations, and 26 variables were also statistically significant with FS regression. Aligned with previous research, key indicators included personal, situational, and mathematics and reading achievement. Further, LASSO identified 28 variables (14.89%) statistically unrelated to schoolwork-related anxiety, which included indicators aligned to students’ academic- and non-academic behaviors.
Conclusions
LASSO and Enet outperformed random forest and yielded comparable results in which determinants of schoolwork-related anxiety included personal and environmental factors, including achievement goals, sense of belonging, and confidence to explain scientific phenomenon. LASSO and FS also identified similar predictor variables related, as well as unrelated, to schoolwork-related anxiety. Aligned with previous research, females reported higher schoolwork-related anxiety than males. Mathematics achievement was negatively related to anxiety, whereas reading performance was positively associated with anxiety. This study also bears significance as one of the first penalized regression studies to incorporate sampling weights and reflect the complex sampling schemes of large-scale educational assessment data.