Measuring the Effect of Examiner Variability in a Multiple-Circuit Objective Structured Clinical Examination (OSCE)

Yeates, Peter; Moult, Alice; Cope, Natalie; McCray, Gareth; Xilas, Eleftheria; Lovelock, Tom; Vaughan, N. D.; Daw, Dan; Fuller, Richard; McKinley, Robert K

doi:10.1097/acm.0000000000004028

Cited by 22 publications

(23 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This study confirms that examiner stringency is a very important influence on stationlevel scoring/grading (Tables 4 and 7), and that adjusting for this does impact on stationlevel scores (Table 6). These findings are consistent with a wide range of literature (Homer, 2020;McManus et al, 2006;Santen et al, 2021;Yeates et al, 2018Yeates et al, , 2021, but our work suggests that acceptable levels of overall assessment reliability can be achieved provided the number of stations is large enough (Table 5)-again consistent with other empirical and/or psychometric work (Bloch & Norman, 2012;Park, 2019). There is a lot of residual variance at the station level, and these results do suggest, however, that a focus on examlevel, rather than station-level, performance of a candidates is likely to be more meaningful in terms of good decision-making.…”

Section: Indicative Differences In Exam-level Decisions (Rq3)supporting

confidence: 91%

“…It is well known that the impact of variation in examiner stringency is a threat to the validity of OSCE-type assessment outcomes (Bartman et al, 2013;Harasym et al, 2008;McManus et al, 2006;Yeates et al, 2018;Yeates & Sebok-Syer, 2017). In larger OSCEs, the assessment design means that candidates are usually grouped in parallel circuits and 'see' a specific set of examiners (Khan et al, 2013;Pell et al, 2010), which means that it is very difficult to disentangle examiner effects from differences in candidate ability (Yeates et al, 2021;Yeates & Sebok-Syer, 2017). In a single administration of a small OSCE there might be a unique set of examiners for each cohort of candidates, but across different exam administrations the same issues of unwanted variation in scores due to examiner stringency arises.…”

Section: Introductionmentioning

confidence: 99%

“…The stated strength of the OSCE, however, is that such error might largely balance out over the exam as a whole through wide sampling of examiners and stations, and via attempts to standardise the overall assessment process through, for example, structured scoring instruments, and appropriate examiner training (Harden et al, 2015, Chapters 9 and 11). However, some research brings into question the assumption that scoring error largely cancels out at the exam level (Homer, 2020;Yeates et al, 2018Yeates et al, , 2021.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes

Homer

2022

Adv in Health Sci Educ

View full text Add to dashboard Cite

Variation in examiner stringency is a recognised problem in many standardised summative assessments of performance such as the OSCE. The stated strength of the OSCE is that such error might largely balance out over the exam as a whole. This study uses linear mixed models to estimate the impact of different factors (examiner, station, candidate and exam) on station-level total domain score and, separately, on a single global grade. The exam data is from 442 separate administrations of an 18 station OSCE for international medical graduates who want to work in the National Health Service in the UK. We find that variation due to examiner is approximately twice as large for domain scores as it is for grades (16% vs. 8%), with smaller residual variance in the former (67% vs. 76%). Combined estimates of exam-level (relative) reliability across all data are 0.75 and 0.69 for domains scores and grades respectively. The correlation between two separate estimates of stringency for individual examiners (one for grades and one for domain scores) is relatively high (r=0.76) implying that examiners are generally quite consistent in their stringency between these two assessments of performance. Cluster analysis indicates that examiners fall into two broad groups characterised as hawks or doves on both measures. At the exam level, correcting for examiner stringency produces systematically lower cut-scores under borderline regression standard setting than using the raw marks. In turn, such a correction would produce higher pass rates—although meaningful direct comparisons are challenging to make. As in other studies, this work shows that OSCEs and other standardised performance assessments are subject to substantial variation in examiner stringency, and require sufficient domain sampling to ensure quality of pass/fail decision-making is at least adequate. More, perhaps qualitative, work is needed to understand better how examiners might score similarly (or differently) between the awarding of station-level domain scores and global grades. The issue of the potential systematic bias of borderline regression evidenced for the first time here, with sources of error producing cut-scores higher than they should be, also needs more investigation.

show abstract

Section: Indicative Differences In Exam-level Decisions (Rq3)supporting

confidence: 91%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes

Homer

2022

Adv in Health Sci Educ

View full text Add to dashboard Cite

show abstract

“…We used secondary data analysis to address this aim, using data from a recent study by Yeates et al 18 derived from a summative Year 3 undergraduate OSCE exam at Keele University Medical School. Students were studying for the qualification MBChB, which is a 5‐year, predominantly undergraduate, course.…”

Section: Methodsmentioning

confidence: 99%

“…Moreover, although the videos for each station were the same for all groups of examiners, the position of the embedded videos within the OSCE sequence varied for different groups of examiners with some viewing a particular video early in the sequence whilst other examiners viewed the same video late in the sequence of performances (i.e., half of participating examiners scored videos A&B early in the sequence and videos C&D late in the sequence whilst the other half scored videos C&D early in the sequence and videos A&B early in the sequence). Consequently, as Yeates et al's 18 comparisons were derived from the combined scores allocated to both early and late videos, the balanced nature of this variation in embedded video sequence would not be expected to have influenced their comparisons. Nonetheless, this variation in embedded video sequence position enables comparison of scores allocated to the same performance when scored either early or late in the assessment sequence.…”

Section: Methodsmentioning

confidence: 99%

Determining influence, interaction and causality of contrast and sequence effects in objective structured clinical exams

et al. 2022

Self Cite

View full text Add to dashboard Cite

Introduction Differential rater function over time (DRIFT) and contrast effects (examiners' scores biased away from the standard of preceding performances) both challenge the fairness of scoring in objective structured clinical exams (OSCEs). This is important as, under some circumstances, these effects could alter whether some candidates pass or fail assessments. Benefitting from experimental control, this study investigated the causality, operation and interaction of both effects simultaneously for the first time in an OSCE setting. Methods We used secondary analysis of data from an OSCE in which examiners scored embedded videos of student performances interspersed between live students. Embedded video position varied between examiners (early vs. late) whilst the standard of preceding performances naturally varied (previous high or low). We examined linear relationships suggestive of DRIFT and contrast effects in all within‐OSCE data before comparing the influence and interaction of ‘early’ versus ‘late’ and ‘previous high’ versus ‘previous low’ conditions on embedded video scores. Results Linear relationships data did not support the presence of DRIFT or contrast effects. Embedded videos were scored higher early (19.9 [19.4–20.5]) versus late (18.6 [18.1–19.1], p < 0.001), but scores did not differ between previous high and previous low conditions. The interaction term was non‐significant. Conclusions In this instance, the small DRIFT effect we observed on embedded videos can be causally attributed to examiner behaviour. Contrast effects appear less ubiquitous than some prior research suggests. Possible mediators of these finding include the following: OSCE context, detail of task specification, examiners' cognitive load and the distribution of learners' ability. As the operation of these effects appears to vary across contexts, further research is needed to determine the prevalence and mechanisms of contrast and DRIFT effects, so that assessments may be designed in ways that are likely to avoid their occurrence. Quality assurance should monitor for these contextually variable effects in order to ensure OSCE equivalence.

show abstract

Towards a more nuanced conceptualisation of differential examiner stringency in OSCEs

Homer

2023

Adv in Health Sci Educ

View full text Add to dashboard Cite

Quantitative measures of systematic differences in OSCE scoring across examiners (often termed examiner stringency) can threaten the validity of examination outcomes. Such effects are usually conceptualised and operationalised based solely on checklist/domain scores in a station, and global grades are not often used in this type of analysis. In this work, a large candidate-level exam dataset is analysed to develop a more sophisticated understanding of examiner stringency. Station scores are modelled based on global grades—with each candidate, station and examiner allowed to vary in their ability/stringency/difficulty in the modelling. In addition, examiners are also allowed to vary in how they discriminate across grades—to our knowledge, this is the first time this has been investigated. Results show that examiners contribute strongly to variance in scoring in two distinct ways—via the traditional conception of score stringency (34% of score variance), but also in how they discriminate in scoring across grades (7%). As one might expect, candidate and station account only for a small amount of score variance at the station-level once candidate grades are accounted for (3% and 2% respectively) with the remainder being residual (54%). Investigation of impacts on station-level candidate pass/fail decisions suggest that examiner differential stringency effects combine to give false positive (candidates passing in error) and false negative (failing in error) rates in stations of around 5% each but at the exam-level this reduces to 0.4% and 3.3% respectively. This work adds to our understanding of examiner behaviour by demonstrating that examiners can vary in qualitatively different ways in their judgments. For institutions, it emphasises the key message that it is important to sample widely from the examiner pool via sufficient stations to ensure OSCE-level decisions are sufficiently defensible. It also suggests that examiner training should include discussion of global grading, and the combined effect of scoring and grading on candidate outcomes.

show abstract

Measuring the Effect of Examiner Variability in a Multiple-Circuit Objective Structured Clinical Examination (OSCE)

Abstract: Supplemental Digital Content is available in the text.

Cited by 22 publications

References 34 publications

Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes

Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes

Determining influence, interaction and causality of contrast and sequence effects in objective structured clinical exams

Towards a more nuanced conceptualisation of differential examiner stringency in OSCEs

Contact Info

Product

Resources

About