2014
DOI: 10.1002/ets2.12005
|View full text |Cite
|
Sign up to set email alerts
|

Monitoring of Scoring Using the e‐rater® Automated Scoring System and Human Raters on a Writing Test

Abstract: This article proposes and investigates several methodologies for monitoring the quality of constructed‐response (CR) scoring, both human and automated. There is an increased interest in the operational scoring of essays using both automated scoring and human raters. There is also evidence of rater effects—scoring severity and score inconsistency by human raters. Recently, automated scoring of CRs was successfully implemented with human scoring for operational programs (TOEFL® and GRE® tests); however, there is… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
17
0
2

Year Published

2014
2014
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 13 publications
(19 citation statements)
references
References 21 publications
0
17
0
2
Order By: Relevance
“…It follows that a fair automated scoring system should not introduce additional construct-irrelevant group-related variance or disadvantage any group of test-takers in comparison to human scores (Penfield, 2016). Several standard measures have been used to evaluate the fairness of the automated scoring systems across different groups, for example speakers of different languages or test-takers with disabilities (Burstein and Chodorow, 1999;Bridgeman et al, 2012;Wang and von Davier, 2014;Wang et al, 2016;Loukina and Buzick, 2017). The two most common analyses are standardized mean score differences and overall model performance for different groups with human scores (predictive ability) (Ramineni and Williamson, 2013;Williamson et al, 2012).…”
Section: Fairness Metrics For Automated Scoringmentioning
confidence: 99%
“…It follows that a fair automated scoring system should not introduce additional construct-irrelevant group-related variance or disadvantage any group of test-takers in comparison to human scores (Penfield, 2016). Several standard measures have been used to evaluate the fairness of the automated scoring systems across different groups, for example speakers of different languages or test-takers with disabilities (Burstein and Chodorow, 1999;Bridgeman et al, 2012;Wang and von Davier, 2014;Wang et al, 2016;Loukina and Buzick, 2017). The two most common analyses are standardized mean score differences and overall model performance for different groups with human scores (predictive ability) (Ramineni and Williamson, 2013;Williamson et al, 2012).…”
Section: Fairness Metrics For Automated Scoringmentioning
confidence: 99%
“…För engelska texter finns redan välutvecklade verktyg för automatisk textanalys, till exempel Coh-metrix som är fritt tillgängligt på internet (McNamara et al, 2014) och kommersiellt framtagna bedömningsverktyg som E-rater (Monaghan & Bridgeman, 2005;Wang & Davier, 2014). För svenska texter har automatisk bedömning endast tillämpats i ett fåtal fall och då med ett begränsat antal lingvistiska mått för den automatiska analysen (Östling et al, 2013;Kann, 2013).…”
Section: Automatisk Bedömning Av Elevtexterunclassified
“…Textegenskaper av mer kvalitativ natur, som originalitet i textens tankeinnehåll eller textens kommunikativa röst, kan inte fångas av en dator. Det är också i kombination med en "human rater" som verktyget "E-rater" används vid ETS i USA (Monaghan & Bridgeman, 2005;Wang & Davier, 2014).…”
Section: Automatisk Bedömning Av Elevtexterunclassified
“…Automated scoring is certainly not immune to such biases and, in fact, several studies have documented differing performance of automated scoring models for test-takers with different native languages or with disabilities (Burstein and Chodorow, 1999;Bridgeman et al, 2012;Wang and von Davier, 2014;Wang et al, 2016;Loukina and Buzick, In print).…”
Section: Detecting Biases In Automated Scoringmentioning
confidence: 99%