The Effects of Rater Severity and Rater Distribution on Examinees' Ability Estimation for Constructed‐response Items

Wang, Zhen; Yao, Lihua

doi:10.1002/j.2333-8504.2013.tb02330.x

Cited by 10 publications

(13 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Unfortunately, the literature on effective quality control procedures using quality control tools on automated scores or long‐term monitoring on both human and automated scores is sparse. However, many studies have been conducted on human scoring and rater effects (DeCarlo, ; Donoghue, McClellan, & Gladkova, ; Engelhard, , ; Longford, ; Myford & Wolfe, ; Patz, Junker, Johnson, & Mariano, ; Wang & Yao, ; Wilson & Hoskens, ; Wolfe & Myford, ). The results from these studies indicate that biases of examinee ability estimates or systematic error may be caused by varying degrees of rater leniency or central tendency.…”

mentioning

confidence: 99%

“…The results from these studies indicate that biases of examinee ability estimates or systematic error may be caused by varying degrees of rater leniency or central tendency. Additionally, rater effects can increase these bias estimates and lower test reliability (Donoghue et al, ; Wang & Yao, ).…”

mentioning

confidence: 99%

“…An issue exists regarding how human raters are assigned to essays (Wang, ). Wang and Yao () used simulated data to examine human‐rater effects. The results of their study indicated that the way essays are assigned to human raters, and the human‐rater effects, especially the effect of human‐raters' scoring severity, can increase the student ability estimate bias and lower test reliability.…”

mentioning

confidence: 99%

“…Few research articles (Lee & von Davier, ; Luecht, ) have proposed the use of quality control techniques to monitor the scoring, equating, and reporting of test scores. The rater effects may substantially increase the bias in students' final scores without careful monitoring (Wang & Yao, ). Walker (), Wang (), and Wang and von Davier () proposed a set of statistics and a framework (examinee, test, prompt level, and rater level) that can be used to monitor the quality and consistency of CR scoring.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test

Wang

Davier

2014

ETS Research Report Series

Self Cite

View full text Add to dashboard Cite

This article proposes and investigates several methodologies for monitoring the quality of constructed‐response (CR) scoring, both human and automated. There is an increased interest in the operational scoring of essays using both automated scoring and human raters. There is also evidence of rater effects—scoring severity and score inconsistency by human raters. Recently, automated scoring of CRs was successfully implemented with human scoring for operational programs (TOEFL® and GRE® tests); however, there is much that is not yet known about the performance of automated scoring systems. Hence, for quality assurance purposes, there is the need to provide a consistent and standardized approach to monitor the quality of the CR scoring over time and across programs. Monitoring the scoring results will help provide scores that are both fair and accurate for test takers and test users, enabling testing programs to detect and correct changes in the severity of scoring.

show abstract

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test

Wang

Davier

2014

ETS Research Report Series

Self Cite

View full text Add to dashboard Cite

show abstract

“…A number of models exist that take into account rater effects. Examples include the many-faceted Rasch model (Linacre, 1989); FACETS model (Lunz, Wright , & Linacre, 1990); an IRT model for multiple raters (Verhelst & Verstralen, 2001); the rater bundle model ; the hierarchical rater model (Patz, Junker, Johnson, & Mariano, 2002) and its signal detection theory version (DeCarlo, 2010;DeCarlo, Kim, and Johnson, 2011); and Yao's rater model (Wang & Yao, 2013). These models are most useful when all the CR items of an assessment have been scored and merged with the multiple choice items (Sgammato & Donoghue, 2018).…”

Section: Subgroup/feature Biasesmentioning

confidence: 99%

Detecting rater effects in trend scoring

Abdalla¹

View full text Add to dashboard Cite

All Rights Reservedii Acknowledgements Of all the pages I had to write for this dissertation, these are the pages that I was most excited to write because I have so much I'm grateful for. First and foremost, to God: thank you for giving me so much ease during such a difficult process and for giving me so many blessings I can't even count. One of those blessings being all the amazing people you've put along my path.

show abstract

Group Optimization to Maximize Peer Assessment Accuracy Using Item Response Theory and Integer Programming

Uto

Nguyen

Ueno

2020

IEEE Trans. Learning Technol.

View full text Add to dashboard Cite

With the wide spread large-scale e-learning environments such as MOOCs, peer assessment has been popularly used to measure the learner ability. When the number of learners increases, peer assessment is often conducted by dividing learners into multiple groups to reduce the learner's assessment workload. However, in such cases, the peer assessment accuracy depends on the method of forming groups. To resolve that difficulty, this study proposes a group formation method to maximize peer assessment accuracy using item response theory and integer programming. Experimental results, however, have demonstrated that the method does not present sufficiently higher accuracy than a random group formation method does. Therefore, this study further proposes an external rater assignment method that assigns a few outside-group raters to each learner after groups are formed using the proposed group formation method. Through results of simulation and actual data experiments, this study demonstrates that the proposed external rater assignment can substantially improve peer assessment accuracy.

show abstract

The Effects of Rater Severity and Rater Distribution on Examinees' Ability Estimation for Constructed‐response Items

Cited by 10 publications

References 11 publications

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test

Detecting rater effects in trend scoring

Group Optimization to Maximize Peer Assessment Accuracy Using Item Response Theory and Integer Programming

Contact Info

Product

Resources

About

The Effects of Rater Severity and Rater Distribution on Examinees' Ability Estimation for Constructed‐response Items

Cited by 10 publications

References 11 publications

Monitoring of Scoring Using the e‐rater® Automated Scoring System and Human Raters on a Writing Test

Monitoring of Scoring Using the e‐rater® Automated Scoring System and Human Raters on a Writing Test

Detecting rater effects in trend scoring

Group Optimization to Maximize Peer Assessment Accuracy Using Item Response Theory and Integer Programming

Contact Info

Product

Resources

About

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test

Monitoring of Scoring Using the e‐rater^® Automated Scoring System and Human Raters on a Writing Test