Judge Consistency and Severity Across Grading Periods

Lunz, Mary E.; Stahl, John A.

doi:10.1177/016327879001300405

Cited by 83 publications

(57 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Some researchers contend that the level of severity a rater exercises is a relatively stable effect that changes little over time and is not modifiable by training (Bernardin and Pence, 1980;Lunz and Stahl, 1990;Lunz, Stahl, and Wright, 1996;O'Neill and Lunz, 1996;O'Neill and Lunz, 2000;Raymond, Webb, and Houston, 1991). By contrast, other researchers argue that some raters' levels of severity can shift substantially from reading to reading (Lumley and McNamara, 1993;Myford, Marr, and Linacre, 1996), from essay topic to essay topic (Bridgeman, Morgan, and Wang, 1996;Weigle, 1999), and from day to day within the same reading (Bleistein and Maneckshana, 1995;Braun, 1988;Coffman and Kurfman, 1968;Morgan, 1998;Wilson and Case, 2000;Wood and Wilson, 1974).…”

Section: Variation In Rater Severitymentioning

confidence: 99%

Monitoring Faculty Consultant Performance in the Advanced Placement English Literature and Composition Program With a Many‐faceted Rasch Model

Engelhard

Myford

2003

ETS Research Report Series

View full text Add to dashboard Cite

The purpose of this study was to examine, describe, evaluate, and compare the rating behavior of faculty consultants who scored essays written for the Advanced Placement English Literature and Composition (AP® ELC) Exam. Data from the 1999 AP ELC Exam were analyzed using FACETS (Linacre, 1998) and SAS. The faculty consultants were not all interchangeable in terms of the level of severity they exercised. If students' ratings had been adjusted for severity differences, the AP grades of about 30 percent of the students would have been different from the one they received. Almost all the differences were one grade or less. Adjusting ratings for faculty consultant severity differences would not impact some student subgroups more than others.

show abstract

Section: Variation In Rater Severitymentioning

confidence: 99%

Monitoring Faculty Consultant Performance in the Advanced Placement English Literature and Composition Program With a Many‐faceted Rasch Model

Engelhard

Myford

2003

ETS Research Report Series

View full text Add to dashboard Cite

show abstract

“…Over the last several years, a number of performance assessment programs interested in examining and understanding sources of variability in their assessment systems have been experimenting with Linacre's (1999a) Facets computer program as a monitoring tool (see, for example, Heller, Sheingold, & Myford, 1998;Linacre, Engelhard, Tatum, & Myford, 1994;Lunz & Stahl, 1990;Myford & Mislevy, 1994;Paulukonis, Myford, & Heller, in press). In this study, we build on the pioneering efforts of researchers who are employing many-facet Rasch measurement to answer questions about complex rating systems for evaluating speaking and writing.…”

Section: Review Of the Literaturementioning

confidence: 99%

Monitoring Sources of Variability Within the Test of Spoken English Assessment System

Myford¹,

Wolfe²

2000

ETS Research Report Series

View full text Add to dashboard Cite

The purposes of this study were to examine four sources of variability within the Test of Spoken English (TSE®) assessment system, to quantify ranges of variability for each source, to determine the extent to which these sources affect examinee performance, and to highlight aspects of the assessment system that might suggest a need for change. Data obtained from the February and April 1997 TSE scoring sessions were analyzed using Facets (Linacre, 1999a). The analysis showed that, for each of the two TSE administrations, the test usefully separated examinees into eight statistically distinct proficiency levels. The examinee proficiency measures were found to be trustworthy in terms of their precision and stability. It is important to note, though, that the standard error of measurement varies across the score distribution, particularly in the tails of the distribution. The items on the TSE appear to work together; ratings on one item correspond well to ratings on the other items. Yet, none of the items seem to function in a redundant fashion. Ratings on individual items within the test can be meaningfully combined; there is little evidence of psychometric multidimensionality in the two data sets. Consequently, it is appropriate to generate a single summary measure to capture the essence of examinee performance across the 12 items. However, the items differ little in terms of difficulty, thus limiting the instrument's ability to discriminate among levels of proficiency. The TSE rating scale functions as a five‐point scale, and the scale categories are clearly distinguishable. The scale maintains a similar though not identical category structure across all 12 items. Raters differ somewhat in the levels of severity they exercise when they rate examinee performances. The vast majority used the scale in a consistent fashion, though. If examinees' scores were adjusted for differences in rater severity, the scores of two‐thirds of the examinees in these administrations would have differed from their raw score averages by 0.5 to 3.6 raw score points. Such differences can have important consequences for examinees whose scores lie in critical decision‐making regions of the score distribution.

show abstract

“…The sixth column (z-scores, or standardized fit statistics) shows the test version rater bias estimate at this phase. Bias is the difference between expected and observed ratings of the obtained data, which is then divided by its standard error to derive the z-score (Lunz & Stahl, 1990). The most preferable z value is 0, which indicates that the data match the expected model, and thus, no rater bias.…”

Section: Resultsmentioning

confidence: 99%

“…Considerable evidence of poor rater consistency has been reported in some research (e.g., Lunz & Stahl, 1990;Trace, Janssen, & Meier, 2017), and even if adequate consistency might have been reported in most research, it is mostly on the basis of correlations alone. That is, even a perfect correlation might ignore systematic variations among raters.…”

Section: Rater Behavior In Oral Performance Assessmentmentioning

confidence: 99%

Investigating the Effect of Training on Raters’ Bias toward Test Takers in Oral Proficiency Assessment : A FACETS Analysis

Bijani¹,

Khabiri²

2017

The Journal of AsiaTEFL

View full text Add to dashboard Cite

Typically, variability among raters in scoring and their bias is mediated through rater training. However, questions still remain about whether training can affect raters' severity or leniency. Furthermore, few studies have looked at the differences between trained and untrained raters in oral assessment. Oral test scores of 200 test takers rated by 20 raters and were analyzed before and after a training program using the multifaceted Rasch measurement (MFRM). The results demonstrated the constructive impact of training programs in reducing raters' biases and increasing their consistency measures. This study indicated that inexperienced raters benefited more from a training program than experienced raters and thus achieved higher measures of consistency afterward. It also demonstrated a higher biased interaction for test takers on the extreme ends of the oral ability continuum. The findings demonstrated that it is almost impossible to completely eradicate rater variability even through rater training. Therefore, rater training should be viewed as a procedure to establish within-rater consistency rather than between-rater consistency. Since this study showed that inexperienced raters can rate even more reliably than experienced ones after training, there is no evidence whereby decision makers can exclude inexperienced raters solely because of their lack of adequate experience. Consequently, decision makers need to use their budgets for establishing rater training programs for inexperienced raters instead.

show abstract

Judge Consistency and Severity Across Grading Periods

Cited by 83 publications

References 6 publications

Monitoring Faculty Consultant Performance in the Advanced Placement English Literature and Composition Program With a Many‐faceted Rasch Model

Monitoring Faculty Consultant Performance in the Advanced Placement English Literature and Composition Program With a Many‐faceted Rasch Model

Monitoring Sources of Variability Within the Test of Spoken English Assessment System

Investigating the Effect of Training on Raters’ Bias toward Test Takers in Oral Proficiency Assessment : A FACETS Analysis

Contact Info

Product

Resources

About