Re-conceptualising and accounting for examiner (cut-score) stringency in a ‘high frequency, small cohort’ performance test

Homer, Matt

doi:10.1007/s10459-020-09990-x

Cited by 6 publications

(13 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This study confirms that examiner stringency is a very important influence on stationlevel scoring/grading (Tables 4 and 7), and that adjusting for this does impact on stationlevel scores (Table 6). These findings are consistent with a wide range of literature (Homer, 2020;McManus et al, 2006;Santen et al, 2021;Yeates et al, 2018Yeates et al, , 2021, but our work suggests that acceptable levels of overall assessment reliability can be achieved provided the number of stations is large enough (Table 5)-again consistent with other empirical and/or psychometric work (Bloch & Norman, 2012;Park, 2019). There is a lot of residual variance at the station level, and these results do suggest, however, that a focus on examlevel, rather than station-level, performance of a candidates is likely to be more meaningful in terms of good decision-making.…”

Section: Indicative Differences In Exam-level Decisions (Rq3)supporting

confidence: 91%

“…We argue that the statistical methods used here are valuable in quantifying error and its impact on the exam overall, but can never be truly confident that all sources of error have been captured and accounted for properly. This in turn implies that adjusting candidatelevel scores and using these for actual decision-making is hard to justify, as has been argued elsewhere (Homer, 2020).…”

Section: Study Limitations and Final Conclusionmentioning

confidence: 99%

“…Analysis of station-level data in this setting (i.e. all candidate level data aggregated to station level) indicates that examiner variation in station-level standards can be accounted for by appropriate use of the standard error of measurement (Hays et al, 2008;Homer, 2020). This paper provides a more finegrained analysis that includes candidate effects using fully anonymised candidate-level data from the same setting.…”

Section: Introductionmentioning

confidence: 98%

“…The stated strength of the OSCE, however, is that such error might largely balance out over the exam as a whole through wide sampling of examiners and stations, and via attempts to standardise the overall assessment process through, for example, structured scoring instruments, and appropriate examiner training (Harden et al, 2015, Chapters 9 and 11). However, some research brings into question the assumption that scoring error largely cancels out at the exam level (Homer, 2020;Yeates et al, 2018Yeates et al, , 2021.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes

Homer

2022

Adv in Health Sci Educ

Self Cite

View full text Add to dashboard Cite

Variation in examiner stringency is a recognised problem in many standardised summative assessments of performance such as the OSCE. The stated strength of the OSCE is that such error might largely balance out over the exam as a whole. This study uses linear mixed models to estimate the impact of different factors (examiner, station, candidate and exam) on station-level total domain score and, separately, on a single global grade. The exam data is from 442 separate administrations of an 18 station OSCE for international medical graduates who want to work in the National Health Service in the UK. We find that variation due to examiner is approximately twice as large for domain scores as it is for grades (16% vs. 8%), with smaller residual variance in the former (67% vs. 76%). Combined estimates of exam-level (relative) reliability across all data are 0.75 and 0.69 for domains scores and grades respectively. The correlation between two separate estimates of stringency for individual examiners (one for grades and one for domain scores) is relatively high (r=0.76) implying that examiners are generally quite consistent in their stringency between these two assessments of performance. Cluster analysis indicates that examiners fall into two broad groups characterised as hawks or doves on both measures. At the exam level, correcting for examiner stringency produces systematically lower cut-scores under borderline regression standard setting than using the raw marks. In turn, such a correction would produce higher pass rates—although meaningful direct comparisons are challenging to make. As in other studies, this work shows that OSCEs and other standardised performance assessments are subject to substantial variation in examiner stringency, and require sufficient domain sampling to ensure quality of pass/fail decision-making is at least adequate. More, perhaps qualitative, work is needed to understand better how examiners might score similarly (or differently) between the awarding of station-level domain scores and global grades. The issue of the potential systematic bias of borderline regression evidenced for the first time here, with sources of error producing cut-scores higher than they should be, also needs more investigation.

show abstract

Section: Indicative Differences In Exam-level Decisions (Rq3)supporting

confidence: 91%

Section: Study Limitations and Final Conclusionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 98%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes

Homer

2022

Adv in Health Sci Educ

Self Cite

View full text Add to dashboard Cite

show abstract

“…As DRIFT effects might result in additional station fails for some students, this could produce unwarranted failure for some candidates. If determined to be of sufficient importance in some instances, this effect could be mitigated by either adjusting students' station-level scores or the station-level pass mark 43 exams over a programme. Importantly, the small (and inconsistently observed) magnitude of the effect we have found in this study may be considered insufficiently important to warrant alterations of this nature, given that other effects (such as the number of OSCE stations 44 ) are known to have a greater influence on reliability of the test.…”

Section: Practical Implicationsmentioning

confidence: 99%

Determining influence, interaction and causality of contrast and sequence effects in objective structured clinical exams

et al. 2022

View full text Add to dashboard Cite

Introduction Differential rater function over time (DRIFT) and contrast effects (examiners' scores biased away from the standard of preceding performances) both challenge the fairness of scoring in objective structured clinical exams (OSCEs). This is important as, under some circumstances, these effects could alter whether some candidates pass or fail assessments. Benefitting from experimental control, this study investigated the causality, operation and interaction of both effects simultaneously for the first time in an OSCE setting. Methods We used secondary analysis of data from an OSCE in which examiners scored embedded videos of student performances interspersed between live students. Embedded video position varied between examiners (early vs. late) whilst the standard of preceding performances naturally varied (previous high or low). We examined linear relationships suggestive of DRIFT and contrast effects in all within‐OSCE data before comparing the influence and interaction of ‘early’ versus ‘late’ and ‘previous high’ versus ‘previous low’ conditions on embedded video scores. Results Linear relationships data did not support the presence of DRIFT or contrast effects. Embedded videos were scored higher early (19.9 [19.4–20.5]) versus late (18.6 [18.1–19.1], p < 0.001), but scores did not differ between previous high and previous low conditions. The interaction term was non‐significant. Conclusions In this instance, the small DRIFT effect we observed on embedded videos can be causally attributed to examiner behaviour. Contrast effects appear less ubiquitous than some prior research suggests. Possible mediators of these finding include the following: OSCE context, detail of task specification, examiners' cognitive load and the distribution of learners' ability. As the operation of these effects appears to vary across contexts, further research is needed to determine the prevalence and mechanisms of contrast and DRIFT effects, so that assessments may be designed in ways that are likely to avoid their occurrence. Quality assurance should monitor for these contextually variable effects in order to ensure OSCE equivalence.

show abstract

Towards a more nuanced conceptualisation of differential examiner stringency in OSCEs

Homer

2023

Adv in Health Sci Educ

Self Cite

View full text Add to dashboard Cite

Quantitative measures of systematic differences in OSCE scoring across examiners (often termed examiner stringency) can threaten the validity of examination outcomes. Such effects are usually conceptualised and operationalised based solely on checklist/domain scores in a station, and global grades are not often used in this type of analysis. In this work, a large candidate-level exam dataset is analysed to develop a more sophisticated understanding of examiner stringency. Station scores are modelled based on global grades—with each candidate, station and examiner allowed to vary in their ability/stringency/difficulty in the modelling. In addition, examiners are also allowed to vary in how they discriminate across grades—to our knowledge, this is the first time this has been investigated. Results show that examiners contribute strongly to variance in scoring in two distinct ways—via the traditional conception of score stringency (34% of score variance), but also in how they discriminate in scoring across grades (7%). As one might expect, candidate and station account only for a small amount of score variance at the station-level once candidate grades are accounted for (3% and 2% respectively) with the remainder being residual (54%). Investigation of impacts on station-level candidate pass/fail decisions suggest that examiner differential stringency effects combine to give false positive (candidates passing in error) and false negative (failing in error) rates in stations of around 5% each but at the exam-level this reduces to 0.4% and 3.3% respectively. This work adds to our understanding of examiner behaviour by demonstrating that examiners can vary in qualitatively different ways in their judgments. For institutions, it emphasises the key message that it is important to sample widely from the examiner pool via sufficient stations to ensure OSCE-level decisions are sufficiently defensible. It also suggests that examiner training should include discussion of global grading, and the combined effect of scoring and grading on candidate outcomes.

show abstract

Re-conceptualising and accounting for examiner (cut-score) stringency in a ‘high frequency, small cohort’ performance test

Cited by 6 publications

References 28 publications

Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes

Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes

Determining influence, interaction and causality of contrast and sequence effects in objective structured clinical exams

Towards a more nuanced conceptualisation of differential examiner stringency in OSCEs

Contact Info

Product

Resources

About