BackgroundEntry into specialty training was determined by a National Assessment Centre (NAC) approach using a combination of a behavioural Multiple-Mini-Interview (MMI) and a written Situational Judgement Test (SJT). We wanted to know if interviewers could make reliable and valid decisions about the non-cognitive characteristics of candidates with the purpose of selecting them into general practice specialty training using the MMI. Second, we explored the concurrent validity of the MMI with the SJT.MethodsA variance components analysis estimated the reliability and sources of measurement error. Further modelling estimated the optimal configurations for future MMI iterations. We calculated the relationship of the MMI with the SJT.ResultsData were available from 1382 candidates, 254 interviewers, six MMI questions, five alternate forms of a 50-item SJT, and 11 assessment centres. For a single MMI question and one assessor, 28% of the variance between scores was due to candidate-to-candidate variation. Interviewer subjectivity, in particular the varying views that interviewer had for particular candidates accounted for 40% of the variance in scores. The generalisability co-efficient for a six question MMI was 0.7; to achieve 0.8 would require ten questions. A disattenuated correlation with the SJT (r = 0.35), and in particular a raw score correlation with the subdomain related to clinical knowledge (r = 0.25) demonstrated evidence for construct and concurrent validity. Less than two per cent of candidates would have failed the MMI.ConclusionThe MMI is a moderately reliable method of assessment in the context of a National Assessment Centre approach. The largest source of error relates to aspects of interviewer subjectivity, suggesting enhanced interviewer training would be beneficial. MMIs need to be sufficiently long for precise comparison for ranking purposes. In order to justify long term sustainable use of the MMI in a postgraduate assessment centre approach, more theoretical work is required to understand how written and performance based test of non-cognitive attributes can be combined, in a way that achieves acceptable generalizability, and has validity.