2008
DOI: 10.1002/j.2333-8504.2008.tb02118.x
|View full text |Cite
|
Sign up to set email alerts
|

Sample‐size Requirements for Automated Essay Scoring

Abstract: Sample‐size requirements were considered for automated essay scoring in cases in which the automated essay score estimates the score provided by a human rater. Analysis considered both cases in which an essay prompt is examined in isolation and those in which a family of essay prompts is studied. In typical cases in which content analysis is not employed and in which the only object is to score individual essays to provide feedback to the examinee, it appears that several hundred essays are sufficient. For app… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2011
2011
2016
2016

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 11 publications
0
5
0
Order By: Relevance
“…Davey () indicated that e‐rater and human scores are imperfectly related owing to this issue. Other models such as cumulative‐logit model (Haberman & Sinharay, ) should be explored and used for future research. Finally, human rating quality in the training sample: human scores need to be carefully monitored during the automated scoring process.…”
Section: Discussionmentioning
confidence: 99%
“…Davey () indicated that e‐rater and human scores are imperfectly related owing to this issue. Other models such as cumulative‐logit model (Haberman & Sinharay, ) should be explored and used for future research. Finally, human rating quality in the training sample: human scores need to be carefully monitored during the automated scoring process.…”
Section: Discussionmentioning
confidence: 99%
“…Because the G and GOPSI models typically deal with a group of 12 or more prompts, they require smaller sample sizes per prompt than the PS models (Williamson, 2009). Less than 100 essays per prompt might be adequate for constructing a G model of a large family of prompts (Haberman & Sinharay, 2008). In contrast, the PS models require a greater volume of training essays.…”
Section: Potential Contributions To the Literaturementioning
confidence: 98%
“…Much effort has been made to develop e-rater scoring models that can be applied to produce reliable essay scores (e.g., Attali, 2007;Bridgeman, Trapani, & Attali, 2012;Haberman & Sinharay, 2008). To date, there are primarily three types of scoring models: the Generic Model (G Model), the Generic with Operational Prompt Specific Intercept Model (GOPSI Model), and the Prompt Specific Model (PS Model).…”
Section: The E-rater Automated Essay Scoring Modelsmentioning
confidence: 99%
“…Traditionally, e-rater models are built on a subset of the essays and evaluated on the remaining essays. Researchers had also proposed to construct a scoring model in a jackknife or n-fold cross-validation fashion (Haberman & Sinharay, 2008). 2 This way, not only the sample size for model calibration is largely increased, but the variation in the entire data set can also be reflected in both model calibration and evaluation.…”
Section: E-rater Automated Essay Scoringmentioning
confidence: 99%
“…PS (press) models were built for each of the 63 prompts separately applying a leave-oneout sampling approach. Models were evaluated using indices derived from the PRESS statistic (Kutner, Nachtsheim, Neter, & Li, 2005, p. 360 Guilford & Fruchter, 1973;Haberman & Sinharay, 2008;and Weisberg, 1985, for exceptions. ) In this study, we used a total of four PRESS-derived indices.…”
Section: Ps (Press) Model Calibration and Evaluationmentioning
confidence: 99%