Explainable Automated Essay Scoring: Deep Learning Really Has Pedagogical Value

Kumar, Vive; Boulanger, David

doi:10.3389/feduc.2020.572367

Cited by 66 publications

(32 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…They highlighted the need for using XAI in the educational field. In the context of automated essay scoring, the authors in [34] have studied the impact and trustworthiness of neural networks by means of the SHAP explanation framework [35]). Similar attempts have been made in the domains of computation thinking [36] and knowledge tracing [29].…”

Section: Literature Reviewmentioning

confidence: 99%

Predicting Student Academic Performance by Means of Associative Classification

et al. 2021

View full text Add to dashboard Cite

The Learning Analytics community has recently paid particular attention to early predict learners’ performance. An established approach entails training classification models from past learner-related data in order to predict the exam success rate of a student well before the end of the course. Early predictions allow teachers to put in place targeted actions, e.g., supporting at-risk students to avoid exam failures or course dropouts. Although several machine learning and data mining solutions have been proposed to learn accurate predictors from past data, the interpretability and explainability of the best performing models is often limited. Therefore, in most cases, the reasons behind classifiers’ decisions remain unclear. This paper proposes an Explainable Learning Analytics solution to analyze learner-generated data acquired by our technical university, which relies on a blended learning model. It adopts classification techniques to early predict the success rate of about 5000 students who were enrolled in the first year courses of our university. It proposes to apply associative classifiers at different time points and to explore the characteristics of the models that led to assign pass or fail success rates. Thanks to their inherent interpretability, associative models can be manually explored by domain experts with the twofold aim at validating classifier outcomes through local rule-based explanations and identifying at-risk/successful student profiles by interpreting the global rule-based model. The results of an in-depth empirical evaluation demonstrate that associative models (i) perform as good as the best performing classification models, and (ii) give relevant insights into the per-student success rate assignments.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Predicting Student Academic Performance by Means of Associative Classification

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Implementing a function to check the quality of predicted scores is another future research direction because we are sometimes interested in knowing the reliability or confidence level of scores predicted by AES. Furthermore, taking advantage of the unique property of the proposed method, namely, its high interpretability in terms of rater biases, another future direction is to analyze how rater biases affect the behavior of AES models, for example, by using an explanation model, as in [83].…”

Section: Discussionmentioning

confidence: 99%

Learning Automated Essay Scoring Models Using Item-Response-Theory-Based Scores to Decrease Effects of Rater Biases

Uto

Okano

2021

IEEE Trans. Learning Technol.

View full text Add to dashboard Cite

In automated essay scoring (AES), scores are automatically assigned to essays as an alternative to grading by humans. Traditional AES typically relies on handcrafted features, whereas recent studies have proposed AES models based on deep neural networks to obviate the need for feature engineering. Those AES models generally require training on a large dataset of graded essays. However, assigned grades in such a training dataset are known to be biased owing to effects of rater characteristics when grading is conducted by assigning a few raters in a rater set to each essay. Performance of AES models drops when such biased data are used for model training. Researchers in the fields of educational and psychological measurement have recently proposed item response theory (IRT) models that can estimate essay scores while considering effects of rater biases. This study therefore proposes a new method that trains AES models using IRT-based scores for dealing with rater bias within training data.

show abstract

“…Explainability techniques have also been assessed in the context of autograders (Kumar & Boulanger 2020). Explanations can increase understanding of automatic grading decisions and provide justification of those decisions.…”

Section: Autogradingmentioning

confidence: 99%

Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type Answers

Schneider

Richner

Riser³

2022

Int J Artif Intell Educ

View full text Add to dashboard Cite

Autograding short textual answers has become much more feasible due to the rise of NLP and the increased availability of question-answer pairs brought about by a shift to online education. Autograding performance is still inferior to human grading. The statistical and black-box nature of state-of-the-art machine learning models makes them untrustworthy, raising ethical concerns and limiting their practical utility. Furthermore, the evaluation of autograding is typically confined to small, monolingual datasets for a specific question type. This study uses a large dataset consisting of about 10 million question-answer pairs from multiple languages covering diverse fields such as math and language, and strong variation in question and answer syntax. We demonstrate the effectiveness of fine-tuning transformer models for autograding for such complex datasets. Our best hyperparameter-tuned model yields an accuracy of about 86.5%, comparable to the state-of-the-art models that are less general and more tuned to a specific type of question, subject, and language. More importantly, we address trust and ethical concerns. By involving humans in the autograding process, we show how to improve the accuracy of automatically graded answers, achieving accuracy equivalent to that of teaching assistants. We also show how teachers can effectively control the type of errors made by the system and how they can validate efficiently that the autograder’s performance on individual exams is close to the expected performance.

show abstract

Explainable Automated Essay Scoring: Deep Learning Really Has Pedagogical Value

Cited by 66 publications

References 28 publications

Predicting Student Academic Performance by Means of Associative Classification

Predicting Student Academic Performance by Means of Associative Classification

Learning Automated Essay Scoring Models Using Item-Response-Theory-Based Scores to Decrease Effects of Rater Biases

Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type Answers

Contact Info

Product

Resources

About