2013
DOI: 10.1002/j.2333-8504.2013.tb02325.x
|View full text |Cite
|
Sign up to set email alerts
|

The Impact of Sampling Approach on Population Invariance in Automated Scoring of Essays

Abstract: Many testing programs use automated scoring to grade essays. One issue in automated essay scoring that has not been examined adequately is population invariance and its causes. The primary purpose of this study was to investigate the impact of sampling in model calibration on population invariance of automated scores. This study analyzed scores produced by the e‐rater® scoring engine using a GRE® assessment data set. Results suggested that equal allocation stratification by language sampling approach performed… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
10
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 15 publications
(11 citation statements)
references
References 17 publications
1
10
0
Order By: Relevance
“…At the same time, there were clear differences among countries/territories with lower difficulty associated with English-native speaking countries and higher levels of difficulty evident in some non-English speaking countries/territories. This result is consistent with the findings of other studies (e.g., Bridgeman et al 2012;Zhang 2013b), and might be caused by smaller variation in English-language writing proficiency among examinees from those nonnative English-speaking countries/territories. 1 Considerably less variation across test countries/territories was found for the MSE, which could reflect the fact that it measures a somewhat different aspect of agreement from R-squared.…”
Section: Discussionsupporting
confidence: 93%
“…At the same time, there were clear differences among countries/territories with lower difficulty associated with English-native speaking countries and higher levels of difficulty evident in some non-English speaking countries/territories. This result is consistent with the findings of other studies (e.g., Bridgeman et al 2012;Zhang 2013b), and might be caused by smaller variation in English-language writing proficiency among examinees from those nonnative English-speaking countries/territories. 1 Considerably less variation across test countries/territories was found for the MSE, which could reflect the fact that it measures a somewhat different aspect of agreement from R-squared.…”
Section: Discussionsupporting
confidence: 93%
“…Therefore, it is still necessary to investigate the human-machine score agreement and find other criterion-related validity evidence for AES systems. Second, even though many studies suggested that AES systems could not evaluate the higher-order aspects of writing proficiency because they could not read and understand like human raters (Attali, 2015;Attali & Burstein, 2006;Attali, Lewis, & Steier, 2012;Weigle, 2011;Zhang, 2013), there is a need for more empirical evidence. Third, most of the AES systems that researchers attended to were developed by institutions in America and unavailable in China, a country with a huge amount of EFL (English as a Foreign Language) learners.…”
Section: Literature Reviewmentioning
confidence: 99%
“…LSA scoring algorithms can be used to quantify individual differences in the extent to which respondents have used the correct words within an essay on a specific topic (Landauer, Laham, & Foltz, 2003), and analyses have shown that LSA is useful in assessing the quality of essays that have been written by respondents on specific topics (Landauer, Laham, & Foltz, 2000). LSA technologies have been adapted to support automated essay scoring in educational settings to provide writing instruction (Streeter, Berstein, Foltz, & DeLand, 2011), as well as for high-stakes writing assessments including the SAT, GRE, and GMAT (Shermis, 2014; Zhang, 2013) and low-stakes writing assessments such as evaluating military leadership and medical diagnostic reasoning (Landauer et al, 2000; LaVoie, Cianciolo, & Martin, 2015; LaVoie et al, 2010). Analyses also demonstrate that LSA-generated scores often have high agreement with subject matter experts (SMEs), that is on par with agreements between SMEs (Landauer et al, 2000, 2003; Shermis, 2014; Shermis, Burstein, Higgins, & Zechner, 2010).…”
Section: Lsa Assessmentsmentioning
confidence: 99%
“…We evaluated the effectiveness of using Latent Semantic Analysis (LSA) to score open-ended short answer responses. LSA-based automated scoring is routinely used for large-scale scoring of essays responses on high- and low-stakes exams (Shermis, 2014; Zhang, 2013), and LSA has been used for short answer scoring (Streeter, Bernstein, Foltz, & DeLand, 2011). To improve the accuracy of short answer scoring, LSA has been limited to test items with limited potential responses rather than items with unconstrained potential responses.…”
mentioning
confidence: 99%