Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications 2015
DOI: 10.3115/v1/w15-0625
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating the performance of Automated Text Scoring systems

Abstract: Various measures have been used to evaluate the effectiveness of automated text scoring (ATS) systems with respect to a human gold standard. However, there is no systematic study comparing the efficacy of these metrics under different experimental conditions. In this paper we first argue that measures of agreement are more appropriate than measures of association (i.e., correlation) for measuring the effectiveness of ATS systems. We then present a thorough review and analysis of frequently used measures of agr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
26
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 34 publications
(26 citation statements)
references
References 29 publications
0
26
0
Order By: Relevance
“…All models are trained on our training set (see Section 4), except the one prefixed 'word2vec pre-trained ' which uses pre-trained embeddings on the Google News Corpus. We report the Spearman's rank correlation coefficient ρ, Pearson's product-moment correlation coefficient r, and the root mean square error (RMSE) between the predicted scores and the gold standard on our test set, which are considered more appropriate metrics for evaluating essay scoring systems (Yannakoudakis and Cummins, 2015). However, we also report Cohen's κ with quadratic weights, which was the evaluation metric used in the Kaggle competition.…”
Section: Resultsmentioning
confidence: 99%
“…All models are trained on our training set (see Section 4), except the one prefixed 'word2vec pre-trained ' which uses pre-trained embeddings on the Google News Corpus. We report the Spearman's rank correlation coefficient ρ, Pearson's product-moment correlation coefficient r, and the root mean square error (RMSE) between the predicted scores and the gold standard on our test set, which are considered more appropriate metrics for evaluating essay scoring systems (Yannakoudakis and Cummins, 2015). However, we also report Cohen's κ with quadratic weights, which was the evaluation metric used in the Kaggle competition.…”
Section: Resultsmentioning
confidence: 99%
“…In this paper, we predict real-valued scores on a continuous scale and evaluate the accuracy of the predicted scores by using mean squared error (MSE) as our default metric. Although some previous studies have used quadratically-weighted kappa (QWK) as another possible metric for evaluating content-scoring models, more recent work has shown that QWK may possess properties that render it less than suitable for automated scoring evaluation (Yannakoudakis and Cummins, 2015).…”
Section: Methodsmentioning
confidence: 99%
“…Being able to detect topical relevance can help prevent such weaknesses, provide useful feedback to the students, and is also a step towards evaluating more creative aspects of learner writing. While there is existing work on detecting answer relevance given a textual prompt (Persing and Ng, 2014;Cummins et al, 2015;Rei and Cummins, 2016), only limited previous research has been done to extend this to visual prompts. Some recent work has investigated answer relevance to visual prompts as part of automated scoring systems (Somasundaran et al, 2015;King and Dickinson, 2016), but they reduced the problem to a textual similarity task by relying on hand-written reference descriptions for each image without directly incorporating visual information.…”
Section: Relevance Detection Modelmentioning
confidence: 99%
“…While there is previous work on assessing the relevance of answers given a textual prompt (Persing and Ng, 2014;Cummins et al, 2015;Rei and Cummins, 2016), very little research has been done to incorporate visual writing prompts. In this setting, students are asked to write a short description about an image in order to assess their language skills, and we would like to automatically evaluate the semantic relevance of their answers.…”
Section: Introductionmentioning
confidence: 99%