Validating human and automated scoring of essays against “True” scores

Cohen, Yoav; Levi, Effi; Ben-Simon, Anat

doi:10.1080/08957347.2018.1464450

Cited by 13 publications

(11 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to the studies of agreement between the automated score and other indicators of writing proficiency, there were also studies of correlation between the automated score and the estimated true score, which was the average of the scores given by a group of different raters to an essay. The common finding of such studies was that the correlation between automated and human scores was nearly the same as inter-rater correlation (Attali, 2007;Attali & Burstein, 2006;Cohen, Levi, & Ben-Simon, 2018). Apart from such analysis of correlation, researchers also tried to create a proper feature weights for the AES model, aiming to optimize the measurement properties and improve the reliability of AES scores (Attali, 2015;Bridgeman & Ramineni, 2017).…”

Section: Literature Reviewmentioning

confidence: 99%

An Evaluation of China’s Automated Scoring System Bingo English

Gao¹,

Li²,

Gu³

et al. 2020

IJEL

View full text Add to dashboard Cite

The study evaluated the effectiveness of Bingo English, one of the representative automated essay scoring (AES) systems in China. 84 essays in an English test held in a Chinese university were collected as the research materials. All the essays were scored by both two trained and experienced human raters and Bingo English, and the linguistic features of them were also quantified in terms of complexity, accuracy, fluency (CAF), content quality, and organization. After examining the agreement between human scores and automated scores and the correlation of human and automated scores with the indicators of the essays’ linguistic features, it was found that Bingo English scores could only reflect the essays’ quality in a general way, and the use of it should be treated with caution.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

An Evaluation of China’s Automated Scoring System Bingo English

Gao¹,

Li²,

Gu³

et al. 2020

IJEL

View full text Add to dashboard Cite

show abstract

Section: Coh-metrix Features and Validation Of Machine Scoringmentioning

confidence: 99%

“…This study follows the same foundations of validating the AES framework, meaning that machine-and human-produced scores were compared as a way of validating scoring accuracy. In other words, human ratings were considered the "gold standard" for evaluating AES scoring performance (Cohen, Levi, & Ben-Simon, 2018;Powers, Escoffery, & Duchnowski, 2015). In operational settings, human raters are trained to score by using a rubric and anchor essays, which help them align their rating processes with the score boundaries and designated writing construct.…”

Section: Validation Of the Essay Scoring Modelmentioning

confidence: 99%

“…Surface-level features (such as word frequencies and length) are in use by some AES systems (e.g., O’Leary et al., 2018), but they are often challenged by the educational community because their rationale and empirical relation with human scores and with other traits of writing quality are not well defined (Attali, 2013; Cohen, Levi, & Ben-Simon, 2018; Perelman, 2014). Recent advances in computational linguistics and natural language processing (NLP) have given rise to more rational methods for extracting features that go beyond surface-level characteristics.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automated scoring of junior and senior high essays using Coh-Metrix features: Implications for large-scale language testing

Latifi

Gierl

2020

Language Testing

View full text Add to dashboard Cite

An automated essay scoring (AES) program is a software system that uses techniques from corpus and computational linguistics and machine learning to grade essays. In this study, we aimed to describe and evaluate particular language features of Coh-Metrix for a novel AES program that would score junior and senior high school students’ essays from their large-scale assessments. Specifically, we studied nine categories of Coh-Metrix features for developing prompt-specific AES scoring models for our sample. We developed the models by capitalizing on the nine features’ informativeness as a function of dimensionality reduction. We used a three-staged scoring framework. The machine scores were validated against a “gold standard” of ratings, that is, those assigned by two human raters. The nine language features reliably captured the construct of the students’ writing quality. We performed a secondary analysis to see how the scoring models performed in relation to other, already established AES systems, and there was no systematic pattern of scoring discrepancy. However, for essays with widely divergent human ratings, the scoring models were disadvantaged owing to the inherent unreliability of the human scores.

show abstract

“…However, the usage of automated scoring systems depends on the obtained scores' being as similar as possible to human raters and their not having low reliability. Human raters are an important criterion for automated scoring systems (Cohen, Levi & Ben-Simon, 2018). Automated scoring results that have poor reliability and are incompatible with human raters may cause wrong decisions about individuals.…”

Section: Introductionmentioning

confidence: 99%

How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language

Uysal

Doğan

2021

Eğitimde Ve Psikolojide Ölçme Ve Değerlendirme Dergisi

View full text Add to dashboard Cite

The use of open-ended items, especially in large-scale tests, created difficulties in scoring open-ended items. However, this problem can be overcome with an approach based on automated scoring of open-ended items. The aim of this study was to examine the reliability of the data obtained by scoring open-ended items automatically. One of the objectives was to compare different algorithms based on machine learning in automated scoring (support vector machines, logistic regression, multinominal Naive Bayes, long-short term memory, and bidirectional long-short term memory). The other objective was to investigate the change in the reliability of automated scoring by differentiating the data rate used in testing the automated scoring system (33%, 20%, and 10%). While examining the reliability of automated scoring, a comparison was made with the reliability of the data obtained from human raters. In this study, which demonstrated the first automated scoring attempt of openended items in the Turkish language, Turkish test data of the Academic Skills Monitoring and Evaluation (ABIDE) program administered by the Ministry of National Education were used. Cross-validation was used to test the system. Regarding the coefficients of agreement to show reliability, the percentage of agreement, the quadratic-weighted Kappa, which is frequently used in automated scoring studies, and the Gwet's AC1 coefficient, which is not affected by the prevalence problem in the distribution of data into categories, were used. The results of the study showed that automated scoring algorithms could be utilized. It was found that the best algorithm to be used in automated scoring is bidirectional long-short term memory. Long-short term memory and multinominal Naive Bayes algorithms showed lower performance than support vector machines, logistic regression, and bidirectional long-short term memory algorithms. In automated scoring, it was determined that the coefficients of agreement at 33% test data rate were slightly lower comparing 10% and 20% test data rates, but were within the desired range.

show abstract

Validating human and automated scoring of essays against “True” scores

Cited by 13 publications

References 10 publications

An Evaluation of China’s Automated Scoring System Bingo English

An Evaluation of China’s Automated Scoring System Bingo English

Automated scoring of junior and senior high essays using Coh-Metrix features: Implications for large-scale language testing

How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language

Contact Info

Product

Resources

About