Sample‐size Requirements for Automated Essay Scoring

Haberman, Shelby J.; Sinharay, Sandip

doi:10.1002/j.2333-8504.2008.tb02118.x

Cited by 5 publications

(5 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Davey () indicated that e‐rater and human scores are imperfectly related owing to this issue. Other models such as cumulative‐logit model (Haberman & Sinharay, ) should be explored and used for future research. Finally, human rating quality in the training sample: human scores need to be carefully monitored during the automated scoring process.…”

Section: Discussionmentioning

confidence: 99%

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test

Wang

Davier

2014

ETS Research Report Series

View full text Add to dashboard Cite

This article proposes and investigates several methodologies for monitoring the quality of constructed‐response (CR) scoring, both human and automated. There is an increased interest in the operational scoring of essays using both automated scoring and human raters. There is also evidence of rater effects—scoring severity and score inconsistency by human raters. Recently, automated scoring of CRs was successfully implemented with human scoring for operational programs (TOEFL® and GRE® tests); however, there is much that is not yet known about the performance of automated scoring systems. Hence, for quality assurance purposes, there is the need to provide a consistent and standardized approach to monitor the quality of the CR scoring over time and across programs. Monitoring the scoring results will help provide scores that are both fair and accurate for test takers and test users, enabling testing programs to detect and correct changes in the severity of scoring.

show abstract

Section: Discussionmentioning

confidence: 99%

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test

Wang

Davier

2014

ETS Research Report Series

View full text Add to dashboard Cite

show abstract

“…Because the G and GOPSI models typically deal with a group of 12 or more prompts, they require smaller sample sizes per prompt than the PS models (Williamson, 2009). Less than 100 essays per prompt might be adequate for constructing a G model of a large family of prompts (Haberman & Sinharay, 2008). In contrast, the PS models require a greater volume of training essays.…”

Section: Potential Contributions To the Literaturementioning

confidence: 98%

“…Much effort has been made to develop e-rater scoring models that can be applied to produce reliable essay scores (e.g., Attali, 2007;Bridgeman, Trapani, & Attali, 2012;Haberman & Sinharay, 2008). To date, there are primarily three types of scoring models: the Generic Model (G Model), the Generic with Operational Prompt Specific Intercept Model (GOPSI Model), and the Prompt Specific Model (PS Model).…”

Section: The E-rater Automated Essay Scoring Modelsmentioning

confidence: 99%

Comparison ofe-rater® Automated Essay Scoring Model Calibration Methods Based on Distributional Targets

Zhang

Williamson

Breyer

et al. 2012

International Journal of Testing

View full text Add to dashboard Cite

This article describes two separate, related studies that provide insight into the effectiveness of e-rater score calibration methods based on different distributional targets. In the first study, we developed and evaluated a new type of e-rater scoring model that was cost-effective and applicable under conditions of absent human rating and small candidate volume. This new model type, called the Scale Midpoint Model, outperformed an existing e-rater scoring model that is often adopted by certain e-rater system users without modification. In the second study, we examined the impact of three distributional score calibration approaches on existing models' performance. These approaches included percentile calibrations on e-rater scores in accordance with a human rating distribution, normal distribution, and uniform distribution. Results indicated that these score calibration approaches did not have overall positive effects on the performance of existing e-rater scoring models.

show abstract

“…Traditionally, e-rater models are built on a subset of the essays and evaluated on the remaining essays. Researchers had also proposed to construct a scoring model in a jackknife or n-fold cross-validation fashion (Haberman & Sinharay, 2008). 2 This way, not only the sample size for model calibration is largely increased, but the variation in the entire data set can also be reflected in both model calibration and evaluation.…”

Section: E-rater Automated Essay Scoringmentioning

confidence: 99%

“…PS (press) models were built for each of the 63 prompts separately applying a leave-oneout sampling approach. Models were evaluated using indices derived from the PRESS statistic (Kutner, Nachtsheim, Neter, & Li, 2005, p. 360 Guilford & Fruchter, 1973;Haberman & Sinharay, 2008;and Weisberg, 1985, for exceptions. ) In this study, we used a total of four PRESS-derived indices.…”

Section: Ps (Press) Model Calibration and Evaluationmentioning

confidence: 99%

INVESTIGATING THE SUITABILITY OF IMPLEMENTING THE E‐RATER® SCORING ENGINE IN A LARGE‐SCALE ENGLISH LANGUAGE TESTING PROGRAM

Zhang

Breyer

Lorenz

2013

ETS Research Report Series

View full text Add to dashboard Cite

In this research, we investigated the suitability of implementing e‐rater® automated essay scoring in a high‐stakes large‐scale English language testing program. We examined the effectiveness of generic scoring and 2 variants of prompt‐based scoring approaches. Effectiveness was evaluated on a number of dimensions, including agreement between the automated and the human score and relations with criterion variables. Results showed that the sample size was generally not sufficient for prompt‐specific scoring. For the generic scoring model, automated scores agreed with human raters as strongly as, or more strongly than, human raters agreed with one another for more than 97% of the prompts. The impact of substituting e‐rater for the second human rater made no practically important impact on test takers' scores at both the item and total test score levels. However, neither automated scoring models nor human raters performed invariantly across all prompts or across different test countries/territories. Further investigation indicated homogeneity in the examinee population, possibly nested within test countries/territories as one potential cause of this lack of invariance. Among other limitations, findings may not be generalizable beyond the examinee population investigated in this study.

show abstract

Sample‐size Requirements for Automated Essay Scoring

Cited by 5 publications

References 11 publications

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test

Comparison ofe-rater® Automated Essay Scoring Model Calibration Methods Based on Distributional Targets

INVESTIGATING THE SUITABILITY OF IMPLEMENTING THE E‐RATER® SCORING ENGINE IN A LARGE‐SCALE ENGLISH LANGUAGE TESTING PROGRAM

Contact Info

Product

Resources

About

Sample‐size Requirements for Automated Essay Scoring

Cited by 5 publications

References 11 publications

Monitoring of Scoring Using the e‐rater® Automated Scoring System and Human Raters on a Writing Test

Monitoring of Scoring Using the e‐rater® Automated Scoring System and Human Raters on a Writing Test

Comparison ofe-rater® Automated Essay Scoring Model Calibration Methods Based on Distributional Targets

INVESTIGATING THE SUITABILITY OF IMPLEMENTING THE E‐RATER® SCORING ENGINE IN A LARGE‐SCALE ENGLISH LANGUAGE TESTING PROGRAM

Contact Info

Product

Resources

About

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test