2021
DOI: 10.3390/educsci11100648
|View full text |Cite
|
Sign up to set email alerts
|

Low Inter-Rater Reliability of a High Stakes Performance Assessment of Teacher Candidates

Abstract: The Performance Assessment for California Teachers (PACT) is a high stakes summative assessment that was designed to measure pre-service teacher readiness. We examined the inter-rater reliability (IRR) of trained PACT evaluators who rated 19 candidates. As measured by Cohen’s weighted kappa, the overall IRR estimate was 0.17 (poor strength of agreement). IRR estimates ranged from −0.29 (worse than expected by chance) to 0.54 (moderate strength of agreement); all were below the standard of 0.70 for consensus ag… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
13
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(14 citation statements)
references
References 23 publications
1
13
0
Order By: Relevance
“…Again, significant gaps can be observed across diverse psychological dimensions, for instance, in the assessment of personality disorders (Fiedler et al, 2004) and in the context of daily behavior (Vazire & Meehl, 2008). Multiple-rater reports of a target person show the same rating discrepancies as self-other rating comparisons that can be found, for example, in the selection of educators (Lyness et al, 2021), in performance ratings (Atkins & Wood, 2006;Fleenor et al, 1996), and in medical diagnoses (Nawka & Knoerding, 2012). Regarding these rating discrepancies (i.e., self-observer and observer-observer), a prominent question in personality assessment goes mostly unnoticed: Is high inter-rater reliability (or self-peer consistency) always desirable?…”
Section: Discrepancies Between Different Perspectives Dating Back To ...mentioning
confidence: 73%
“…Again, significant gaps can be observed across diverse psychological dimensions, for instance, in the assessment of personality disorders (Fiedler et al, 2004) and in the context of daily behavior (Vazire & Meehl, 2008). Multiple-rater reports of a target person show the same rating discrepancies as self-other rating comparisons that can be found, for example, in the selection of educators (Lyness et al, 2021), in performance ratings (Atkins & Wood, 2006;Fleenor et al, 1996), and in medical diagnoses (Nawka & Knoerding, 2012). Regarding these rating discrepancies (i.e., self-observer and observer-observer), a prominent question in personality assessment goes mostly unnoticed: Is high inter-rater reliability (or self-peer consistency) always desirable?…”
Section: Discrepancies Between Different Perspectives Dating Back To ...mentioning
confidence: 73%
“…Each of these five tasks includes two or three criteria (see Table 1), which are each scored on a 4-point ordinal scale, with 1 = Fail, 2 = Basic or Pass, 3 = Proficient, and 4 = Advanced. These ordinal scores are derived from the evaluation of two types of evidence in PACT submissions: artifacts (evidence that candidates submit to show teacher competence, e.g., lesson plans, videos, student work samples) and commentaries (written responses to standardized questions that provide context and rationales for the artifacts submitted) [12]. a Assessed throughout the teacher performance event.…”
Section: Methodsmentioning
confidence: 99%
“…After initial training, the evaluators were required to attend annual re-calibration events. For more details about PACT evaluator training and calibration, please see [12].…”
Section: Pact Evaluator Training Calibration and Scoringmentioning
confidence: 99%
See 2 more Smart Citations