2022
DOI: 10.1609/aaai.v36i11.21563
|View full text |Cite
|
Sign up to set email alerts
|

Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees

Abstract: Automated Scoring (AS), the natural language processing task of scoring essays and speeches in an educational testing setting, is growing in popularity and being deployed across contexts from government examinations to companies providing language proficiency services. However, existing systems either forgo human raters entirely, thus harming the reliability of the test, or score every response by both human and machine thereby increasing costs. We target the spectrum of possible solutions in between, making u… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 26 publications
0
5
0
Order By: Relevance
“…Often, scientific reasoning subscores are heavily weighted towards the negative class (i.e., a large majority of the students do not demonstrate scientific reasoning). Cohen's QWK was chosen because it is widely used in the automated essay scoring (AES) literature (Singh et al 2023;Singla et al 2022). Unlike traditional Cohen's k (Cohen 1960), Cohen's QWK accounts for the degree of disagreement, making it well-suited for ordinal data.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Often, scientific reasoning subscores are heavily weighted towards the negative class (i.e., a large majority of the students do not demonstrate scientific reasoning). Cohen's QWK was chosen because it is widely used in the automated essay scoring (AES) literature (Singh et al 2023;Singla et al 2022). Unlike traditional Cohen's k (Cohen 1960), Cohen's QWK accounts for the degree of disagreement, making it well-suited for ordinal data.…”
Section: Resultsmentioning
confidence: 99%
“…Advances in natural language processing (NLP) have produced improved automated assessment scoring approaches to support teaching and learning (e.g., Adair et al 2023;Wilson et al 2021). Proposed methodologies include data augmentation , next sentence prediction (Wu et al 2023), prototypical neural networks (Zeng et al 2023), cross-prompt fine-tuning (Funayama et al 2023), human-in-the-loop scoring via sam-pling responses (Singla et al 2022), and reinforcement learning (Liu et al 2022). While these methods have enjoyed varying degrees of success, a majority of these applications have targeted more structured mathematics and computer science tasks (i.e., tasks that can be solved formulaically), but their grading is different from scoring free-form shortanswer responses by middle school students in science domains.…”
Section: Introductionmentioning
confidence: 99%
“…Similarly, it has been a common complaint that AES systems focus unjustifiably on obscure and difficult vocabulary (Perelman et al, 2014a). While earlier, each score generated by the AI systems was verified by an expert human rater, it is concerning to see that now many of them are scoring independently without any intervention by human experts (O'Donnell, 2020;Singla et al, 2022a). The concerns are further alleviated by the fact that the scores awarded by such systems are used in life-changing decisions ranging from college and job applications to visa approvals (ETS, 2020b;Educational Testing Association, 2019;USBE, 2020;Institute, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…These models, apart from defending AES systems against samples causing oversensitivity and overstability, can also inform effective human intervention strategy. For instance, AES deployments either completely rely on double scoring essay samples (human and machine) or solely on machine ratings alone (ETS, 2020a;Singla et al, 2022a). With the developed model, AES deployments can choose to have an effective middle ground by selecting samples for human testing and intervention more effectively.…”
Section: Introductionmentioning
confidence: 99%
“…AES utilizes Natural Language Processing (NLP) and Machine Learning (ML) techniques to evaluate these essays in a more efficient and scalable manner. Commonly used in standardized tests like the GRE and the TOEFL (Attali and Burstein 2006), many organizations and education councils have turned to use AES to reduce workload (Singla et al 2022).…”
Section: Introductionmentioning
confidence: 99%