BackgroundThe German quality assurance programme for evaluating work capacity is based on peer review that evaluates the quality of medical experts’ reports. Low reliability is thought to be due to systematic differences among peers. For this purpose, we developed a curriculum for a standardized peer-training (SPT). This study investigates, whether the SPT increases the inter-rater reliability of social medical physicians participating in a cross-institutional peer review.MethodsForty physicians from 16 regional German Pension Insurances were subjected to SPT. The three-day training course consist of nine educational objectives recorded in a training manual. The SPT is split into a basic module providing basic information about the peer review and an advanced module for small groups of up to 12 peers training peer review using medical reports. Feasibility was tested by assessing selection, comprehensibility and subjective use of contents delivered, the trainers’ delivery and design of training materials. The effectiveness of SPT was determined by evaluating peer concordance using three anonymised medical reports assessed by each peer. Percentage agreement and Fleiss’ kappa (κm) were calculated. Concordance was compared with review results from a previous unstructured, non-standardized peer-training programme (control condition) performed by 19 peers from 12 German Pension Insurances departments. The control condition focused exclusively on the application of peer review in small groups. No specifically training materials, methods and trainer instructions were used.ResultsPeer-training was shown to be feasible. The level of subjective confidence in handling the peer review instrument varied between 70 and 90%. Average percentage agreement for the main outcome criterion was 60.2%, resulting in a κm of 0.39. By comparison, the average percentage concordance was 40.2% and the κm was 0.12 for the control condition.ConclusionConcordance with the main criterion was relevant but not significant (p = 0.2) higher for SPT than for the control condition. Fleiss’ kappa coefficient showed that peer concordance was higher for SPT than randomly expected. Nevertheless, a score of 0.39 for the main criterion indicated only fair inter-rater reliability, considerably lower than the conventional standard of 0.7 for adequate reliability.