Detecting Differential Rater Functioning in Severity and Centrality: The Dual DRF Facets Model

Jin, Kuan‐Yu; Eckes, Thomas

doi:10.1177/00131644211043207

Cited by 9 publications

(8 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In future studies, researchers could consider an ANOVA-based sequential approach to detecting DRF under various conditions. In addition, researchers may also consider the use of the recently proposed DDRF model (Jin & Eckes, 2021) for detecting and adjusting for differential rater severity and centrality effects alongside the sequential approach and to compare the efficiency of these two methods to detect DRF under a variety of conditions. Researchers could consider the alignment between the sequential approach and scale purification procedures (Magis & Facon, 2013; W.-C. Wang et al, 2009) to further explore the conditions under which the sequential approach can accurately identify artificial and real DRF.…”

Section: Discussionmentioning

confidence: 99%

“…Only recently have researchers proposed methods to statistically control for DRF. Specifically, Jin and Eckes (2021) proposed the Dual Differential Rater Functioning (DDRF) model that allows researchers to detect and control for differential rater severity and rater centrality. This approach is promising, and additional research is needed to explore its use in language testing research.…”

mentioning

confidence: 99%

See 1 more Smart Citation

A sequential approach to detecting differential rater functioning in sparse rater-mediated assessment networks

Wind

2022

Language Testing

View full text Add to dashboard Cite

Researchers frequently evaluate rater judgments in performance assessments for evidence of differential rater functioning (DRF), which occurs when rater severity is systematically related to construct-irrelevant student characteristics after controlling for student achievement levels. However, researchers have observed that methods for detecting DRF may be limited in sparse rating designs, where it is not possible for every rater to score every student. In these designs, there is limited information with which to detect DRF. Sparse designs can also exacerbate the impact of artificial DRF, which occurs when raters are inaccurately flagged for DRF due to statistical artifacts. In this study, a sequential method is adapted from previous research on differential item functioning (DIF) that allows researchers to detect DRF more accurately and distinguish between true and artificial DRF. Analyses of data from a rater-mediated writing assessment and a simulation study demonstrate that the sequential approach results in different conclusions about which raters exhibit DRF. Moreover, the simulation study results suggest that the sequential procedure results in improved accuracy in DRF detection across a variety of rating design conditions. Practical implications for language testing research are discussed.

show abstract

Section: Discussionmentioning

confidence: 99%

mentioning

confidence: 99%

A sequential approach to detecting differential rater functioning in sparse rater-mediated assessment networks

Wind

2022

Language Testing

View full text Add to dashboard Cite

show abstract

“…Third, generalization inferences require that students’ writing performance not be specific to a particular testing time, score type, or rater. The latter is a consistent concern with scoring student responses (e.g., Jin & Eckes, 2022).…”

Section: Difficulties In Evaluating Writingmentioning

confidence: 97%

“…When directly evaluating students’ written compositions, human raters have shown systematic biases such as assigning different scores to students of diverse backgrounds, changing their application of rubric criteria over time, and overusing the middle score values (Huang, 2012; Jin & Eckes, 2022). Getting consistent rater agreement can be particularly challenging with holistic rubrics due to the unclear weighting that raters place on different traits, or qualities or aspects of the writing, when assigning the overall score (Ohta et al, 2018).…”

Section: Difficulties In Evaluating Writingmentioning

confidence: 99%

“…Relatedly, generalization inferences in Kane's (2013) framework require that students' writing performance not be specific to a particular testing time, score type, or rater. The latter is a consistent concern with scoring student responses (e.g., Jin & Eckes, 2022). Our study used different raters and score types, but that may not be feasible for schools to do, especially if administering multiple prompts over time.…”

Section: Differential Prediction From Interim Scoresmentioning

confidence: 99%

See 1 more Smart Citation

Potential scoring and predictive bias in interim and summative writing assessments.

Reed¹,

Mercer²

2023

School Psychology

View full text Add to dashboard Cite

Interim and summative assessments often are used to make decisions about student writing skills and needs for instruction, but the extent to which different raters and score types might introduce bias for some groups of students is largely unknown. To evaluate this possibility, we analyzed interim writing assessments and state summative test data for 2,621 students in Grades 3-11. Both teachers familiar with students and researchers unaware of students' identifying characteristics evaluated the interim assessments with analytic rubrics. Teachers assigned higher scores on the interim assessments than researchers. Female students had higher scores than males, and English learners (ELs), students eligible for free or reduced-price school lunch (FRL), and students eligible for special education (SPED) had lower scores than other students. These differences were smaller with researcher compared to teacher ratings. Across grade levels, interim assessment scores were similarly predictive of state rubric scores, scale scores, and proficiency designations across student groups. However, students identified as Hispanic, FRL, EL, or SPED had lower scale scores and a lower likelihood of reaching proficiency on the state exam. For this reason, these students' risk of unsuccessful performance on the state exam would be greater than predicted when based on interim assessment scores. These findings highlight the potential importance of masking student identities when evaluating writing to reduce scoring bias and suggest that the written composition portions of high-stakes writing examinations may be less biased against historically marginalized groups than the multiple choice portions of these exams. Impact and ImplicationsThis study examined the extent to which there may be bias in evaluating students' writing performance in Grades 3-11, which has implications for the use of writing assessment scores to make instructional decisions. Findings suggest scoring bias might exist among teacher ratings of written compositions for some groups of students (e.g., English learners, students in special education, and students on free or reduced-price lunch), and group differences were apparent in the multiple choice items on the state writing assessment. Bias in writing assessment might be reduced with the use of composition tests (as opposed to multiple choice items) and by having raters who are well trained and masked to students' identities score those compositions.

show abstract

Task design and rater effects in task‐based language assessment

O'Grady

2024

TESOL Journal

View full text Add to dashboard Cite

Task‐based language assessment represents a major component of task‐based language teaching syllabi. Current perspectives emphasise the importance of tasks in the assessment process, suggesting that adherence to influential models of language production during task design yields predictable test outcomes. The current study contends that the significance of the rater has been overlooked, resulting in adverse consequences when employing current task‐based frameworks. The article reviews literature on the rating process, with a focus on interactions between rater characteristics, rating scales, and rater effects. Drawing on the findings of the review, the author proposes a revised model of task‐based language assessment that curbs the impact of task design on test scores. The revised model underscores the need to regard raters as active agents in the assessment process to enhance validity and fairness and makes recommendations about minimizing rater effects.

show abstract

Detecting Differential Rater Functioning in Severity and Centrality: The Dual DRF Facets Model

Cited by 9 publications

References 42 publications

A sequential approach to detecting differential rater functioning in sparse rater-mediated assessment networks

A sequential approach to detecting differential rater functioning in sparse rater-mediated assessment networks

Potential scoring and predictive bias in interim and summative writing assessments.

Task design and rater effects in task‐based language assessment

Contact Info

Product

Resources

About