Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop 2020
DOI: 10.18653/v1/2020.acl-srw.32
|View full text |Cite
|
Sign up to set email alerts
|

Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation

Abstract: Many recent Short Answer Scoring (SAS) systems have employed Quadratic Weighted Kappa (QWK) as the evaluation measure of their systems. However, we hypothesize that QWK is unsatisfactory for the evaluation of the SAS systems when we consider measuring their effectiveness in actual usage. We introduce a new task formulation of SAS that matches the actual usage. In our formulation, the SAS systems should extract as many scoring predictions that are not critical scoring errors (CSEs). We conduct the experiments i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

2
4

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 15 publications
1
6
0
Order By: Relevance
“…For automatic scoring coverage, the posterior tends to be slightly dominant in ASAP-SAS, whereas the trust score is slightly dominant in the RIKEN dataset. The slightly higher performance of the trust score for the RIKEN dataset is consistent with Funayama et al [5], which was the only study, to the best of our knowledge, to utilize confidence in the SAS field.…”
Section: Resultssupporting
confidence: 88%
See 3 more Smart Citations
“…For automatic scoring coverage, the posterior tends to be slightly dominant in ASAP-SAS, whereas the trust score is slightly dominant in the RIKEN dataset. The slightly higher performance of the trust score for the RIKEN dataset is consistent with Funayama et al [5], which was the only study, to the best of our knowledge, to utilize confidence in the SAS field.…”
Section: Resultssupporting
confidence: 88%
“…In this study, we extend the work of Funayma et al [5], and propose a new framework for minimizing human scoring costs while controlling the overall scoring quality of the combining human scoring and automated scoring. We also conducted cross-lingual experiments using a Japanese SAS dataset, as well as the ASAP dataset commonly used in the SAS field.…”
Section: Previous Researchmentioning
confidence: 90%
See 2 more Smart Citations
“…SAG is the task of estimating scores of short-text answers written as an answer to a given prompt, on the basis of whether the answer satisfies the rubrics prepared by a human in advance (Mohler et al 2011;Funayama et al 2020;Mizumoto et al 2019). The SAG systems play a central role in providing stable and sustainable scoring in repeated and large-scale examinations and (online) self-study learning support systems (Attali and Burstein 2006;Mizumoto et al 2019; Shermis et al 2010;Leacock and Chodorow 2003;Burrows et al 2015).…”
Section: Figmentioning
confidence: 99%