2012
DOI: 10.1080/15305058.2011.645973
|View full text |Cite
|
Sign up to set email alerts
|

Comparison ofe-rater® Automated Essay Scoring Model Calibration Methods Based on Distributional Targets

Abstract: This article describes two separate, related studies that provide insight into the effectiveness of e-rater score calibration methods based on different distributional targets. In the first study, we developed and evaluated a new type of e-rater scoring model that was cost-effective and applicable under conditions of absent human rating and small candidate volume. This new model type, called the Scale Midpoint Model, outperformed an existing e-rater scoring model that is often adopted by certain e-rater system… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2013
2013
2022
2022

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 10 publications
0
3
0
Order By: Relevance
“…The comparison was done using two criteria. The first criterion was human/machine agreement, which was indicated by four commonly used indices (i.e., Pearson correlation coefficient, quadratic-weighted kappa, exact percentage agreement, and standardized mean score difference; see Zhang, Williamson, Breyer, & Trapani, 2012, for computation details).…”
Section: Stratification Variablesmentioning
confidence: 99%
“…The comparison was done using two criteria. The first criterion was human/machine agreement, which was indicated by four commonly used indices (i.e., Pearson correlation coefficient, quadratic-weighted kappa, exact percentage agreement, and standardized mean score difference; see Zhang, Williamson, Breyer, & Trapani, 2012, for computation details).…”
Section: Stratification Variablesmentioning
confidence: 99%
“…The impact of other weighting schemes than those investigated in this study, such as feature reliability–based weights (under study by Attali, ), in reducing the undue influence of a few features on the final score may potentially reduce the score differences at the subgroup level. Another potential avenue for investigation is the impact of variations in sample size for the training set as well as for the representation of the different demographic subgroups in the training sample (as studied by Zhang, Williamson, Breyer, & Trapani, ) on feature weights and consequent model performance at the subgroup level.…”
Section: Resultsmentioning
confidence: 99%
“…The operational e-rater scoring model is constructed by regressing human ratings on the features, which results in a multiple linear regression that can be applied to generate scores that a human rater would assign to a given essay. Although continuous research efforts have been allocated to develop new types of scoring models, with an intention to enhance the automated score validity (e.g., Ben-Simon & Bennett, 2007;Zhang, Williamson, Breyer, & Trapani, 2012), to date, only two types of models have been established for operational practice in large-scale assessments: the generic model (G model) and the prompt-specific model (PS model).…”
Section: E-rater Automated Essay Scoringmentioning
confidence: 99%