Regular performance assessment is an integral part of (high-) risk industries. Past research shows, however, that in many fields, inter-rater reliabilities tend to be moderate to low. This study was designed to investigate the variability of performance assessment in a naturalistic setting in aviation. A modified think-aloud protocol was used as research design to investigate the reasoning pairs of pilots use to assess the performance of an airline captain in a high-risk situation. Standard protocol analysis and interaction analysis methods were employed in the analysis of transcribed verbal protocols. The analyses confirm high variability in performance assessment and reveal the good, albeit fuzzy, justifications that assessor pairs use to ground their assessments. A fuzzy logic model exhibits a good approximation between predicted and actual ratings. Implications for the practice of performance assessment are provided.
Relevance to industryGiven that a low performance assessment can lead to re-examination and change in employment status, many industries aim at achieving consistency in identifying true performance levels. In view of the complexity of flying a modern aircraft, variability in performance assessment may be the norm and high inter-rater reliability may never be achievable. However, if the variability in performance assessment is a real phenomenon, as reported here, then practitioners and research might have to test whether it can be used positively (e.g., as opportunity for fruitful discussions during training situations that improve the resilience of flight crews).