Metrics including Cohen's kappa, precision, recall, and F1 are common measures of performance for models of discrete student states, such as a student's affect or behaviour. This study examined discrete model metrics for previously published student model examples to identify situations where metrics provided differing perspectives on model performance. Simulated models also systematically showed the effects of imbalanced class distributions in both data and predictions, in terms of the values of metrics and the chance levels (values obtained by making random predictions) for those metrics. Random chance level for F1 was also established and evaluated. Results for example student models showed that overprediction of the class of interest (positive class) was relatively common. Chance-level F1 was inflated by over-prediction; conversely, maximum possible values for F1 and kappa were negatively impacted by overprediction of the positive class. Additionally, normalization methods for F1 relative to chance are discussed and compared to kappa, demonstrating an equivalence between kappa and normalized F1. Finally, implications of results for choice of metrics are discussed in the context of common student modelling goals, such as avoiding false negatives for student states that are negatively related to learning.
Notes for Practice• Previous research has shown that choice of metric plays a key role in training and evaluation of student models, focusing primarily on metrics intended for models that produce probabilistic predictions of student outcome variables• Imbalances in labelled data are quite common in student modelling tasks, and have been shown to impact metrics used for machine-learned student models• This paper explores the impact that predicted class proportions and data class proportions have on discrete model metrics including Cohen's kappa, precision, recall, and F1, and formulates a random-chance level F1 measurement that is adjusted for imbalances• Results on real-world student models and simulated models show that best practices include reporting multiple metrics for discrete student models, and comparing F1 scores to the appropriate chance level to avoid over-or under-estimating model performance