Low Inter-Rater Reliability of a High Stakes Performance Assessment of Teacher Candidates

Lyness, Scott A.; Peterson, Kent D.; Yates, Kenneth A.

doi:10.3390/educsci11100648

Cited by 4 publications

(14 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Again, significant gaps can be observed across diverse psychological dimensions, for instance, in the assessment of personality disorders (Fiedler et al, 2004) and in the context of daily behavior (Vazire & Meehl, 2008). Multiple-rater reports of a target person show the same rating discrepancies as self-other rating comparisons that can be found, for example, in the selection of educators (Lyness et al, 2021), in performance ratings (Atkins & Wood, 2006;Fleenor et al, 1996), and in medical diagnoses (Nawka & Knoerding, 2012). Regarding these rating discrepancies (i.e., self-observer and observer-observer), a prominent question in personality assessment goes mostly unnoticed: Is high inter-rater reliability (or self-peer consistency) always desirable?…”

Section: Discrepancies Between Different Perspectives Dating Back To ...mentioning

confidence: 73%

Asymmetric sampling of personality (ASP): A novel approach to understanding validity limits in personality assessment and possible remedies

Grüning,

Lechner,

Le Mens

et al. 2024

Preprint

View full text Add to dashboard Cite

Assessing people's personalities using self-reports is complicated by three central problems: low predictability of behavior, discrepancies between self- and observer reports, and divergent target reports across observers. Going beyond existing research on common survey biases, we introduce an information sampling bias that can explain all three problems. In judgment and decision research, asymmetric sampling occurs when an individual can only gather a sample of information about a target object (e.g., environment, person) from his or her own experience. It follows that any personal sample is limited by the environment and, given behavioral baselines, is selective for certain experiences (e.g., based on positive affect or habits). We apply the sampling framework to personality assessment and show that, independent of motivational constraints, selective sampling alone invokes an asymmetric mental model of one's own (but also others') personality (e.g., extraversion) in which certain situations are over- or under-represented. We call this asymmetric sampling of personality (ASP). Asymmetric samples of experienced situations lack the generalizability to reliably predict behavior (as a tendency to behave consistently across situations). Moreover, differently biased situation samples can explain self-observer discrepancies in personality ratings as well as multiple observer divergences, calling for a revision of quality standards for inter-rater reliability. Understanding ASP provides a methodological solution to these three problems: it assists all respondents (i.e., self- and peer-observers) in symmetrically sampling situations that generally make it easy or difficult to express a particular personality trait. We discuss ASP in assessment contexts, as well as procedural extensions of the sampling framework, and suggest specific study paradigms for quantifying individual bias due to ASP.

show abstract

Section: Discrepancies Between Different Perspectives Dating Back To ...mentioning

confidence: 73%

Asymmetric sampling of personality (ASP): A novel approach to understanding validity limits in personality assessment and possible remedies

Grüning,

Lechner,

Le Mens

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Each of these five tasks includes two or three criteria (see Table 1), which are each scored on a 4-point ordinal scale, with 1 = Fail, 2 = Basic or Pass, 3 = Proficient, and 4 = Advanced. These ordinal scores are derived from the evaluation of two types of evidence in PACT submissions: artifacts (evidence that candidates submit to show teacher competence, e.g., lesson plans, videos, student work samples) and commentaries (written responses to standardized questions that provide context and rationales for the artifacts submitted) [12]. a Assessed throughout the teacher performance event.…”

Section: Methodsmentioning

confidence: 99%

“…After initial training, the evaluators were required to attend annual re-calibration events. For more details about PACT evaluator training and calibration, please see [12].…”

Section: Pact Evaluator Training Calibration and Scoringmentioning

confidence: 99%

“…I wanted to examine the IRR in our candidates' edTPA data, but when double-scored data were requested, access was not provided because the data are proprietary to Pearson Inc. In a prior study [12], IRR was assessed in the PACT during the last academic year (2014) of its use at our school. In the current study, the sample size has been expanded by over nine times, and the sample includes PACT data that were collected during the duration of its use at our institution, from 2010 to 2015, before its replacement by the edTPA.…”

Section: Introductionmentioning

confidence: 99%

“…In the current study, the sample size has been expanded by over nine times, and the sample includes PACT data that were collected during the duration of its use at our institution, from 2010 to 2015, before its replacement by the edTPA. Because of the high-stakes nature of the TPA (receiving or not receiving your credential), examining IRR data was important for several reasons: (a) IRR data in TPAs have been reported as percent agreement without correction for chance agreement, and (b) double-scored data are not frequently reported and are even harder to obtain now that they are often proprietary to corporations such as Pearson [12].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

More Evidence of Low Inter-Rater Reliability of a High-Stakes Performance Assessment of Teacher Candidates

Lyness

2024

Education Sciences

View full text Add to dashboard Cite

From 2010 to 2015, our school of education used the Performance Assessment for California Teachers (PACT), a summative assessment designed to assess preservice teacher competence. Candidate portfolios were uploaded to an evaluation portal, and trained evaluators assigned a final score of Pass or Fail to the work samples. Three consensus estimates of inter-rater reliability of 181 candidate portfolios that were either double- or triple-scored were computed. Two chance-corrected estimates of inter-rater reliability (Cohen’s kappa and Gwet’s AC1) and percent agreement were computed and calculated within five content areas: elementary math, secondary history/social science, math, science, and English language arts. An initial Pass or Fail score was not more likely to be followed by either a Pass or Fail score given by a subsequent evaluator. Inter-rater reliability was interpreted as being low across all content areas that were examined. None of the percent agreement coefficients attained the minimum standard of 0.700 for consensus agreement. Increasing research access to proprietary double-scored data would lead to an increased understanding of, and perhaps improvement in, teacher performance assessments.

show abstract

The analysis of marking reliability through the approach of gauge repeatability and reproducibility (GR&R) study: a case of English-speaking test

Sureeyatanapas,

Panitanarak

et al. 2024

Lang Test Asia

View full text Add to dashboard Cite

Ensuring consistent and reliable scoring is paramount in education, especially in performance-based assessments. This study delves into the critical issue of marking consistency, focusing on speaking proficiency tests in English language learning, which often face greater reliability challenges. While existing literature has explored various methods for assessing marking reliability, this study is the first of its kind to introduce an alternative statistical tool, namely the gauge repeatability and reproducibility (GR&R) approach, to the educational context. The study encompasses both intra- and inter-rater reliabilities, with additional validation using the intraclass correlation coefficient (ICC). Using a case study approach involving three examiners evaluating 30 recordings of a speaking proficiency test, the GR&R method demonstrates its effectiveness in detecting reliability issues over the ICC approach. Furthermore, this research identifies key factors influencing scoring inconsistencies, including group performance estimation, work presentation order, rubric complexity and clarity, the student’s chosen topic, accent familiarity, and recording quality. Importantly, it not only pinpoints these root causes but also suggests practical solutions, thereby enhancing the precision of the measurement system. The GR&R method can offer significant contributions to stakeholders in language proficiency assessment, including educational institutions, test developers and policymakers. It is also applicable to other cases of performance-based assessments. By addressing reliability issues, this study provides insights to enhance the fairness and accuracy of subjective judgements, ultimately benefiting overall performance comparisons and decision making.

show abstract

Low Inter-Rater Reliability of a High Stakes Performance Assessment of Teacher Candidates

Cited by 4 publications

References 23 publications

Asymmetric sampling of personality (ASP): A novel approach to understanding validity limits in personality assessment and possible remedies

Asymmetric sampling of personality (ASP): A novel approach to understanding validity limits in personality assessment and possible remedies

More Evidence of Low Inter-Rater Reliability of a High-Stakes Performance Assessment of Teacher Candidates

The analysis of marking reliability through the approach of gauge repeatability and reproducibility (GR&R) study: a case of English-speaking test

Contact Info

Product

Resources

About