Cohen's j is the most important and most widely accepted measure of interrater reliability when the outcome of interest is measured on a nominal scale. The estimates of Cohen's j usually vary from one study to another due to differences in study settings, test properties, rater characteristics and subject characteristics. This study proposes a formal statistical framework for meta-analysis of Cohen's j to describe the typical interrater reliability estimate across multiple studies, to quantify between-study variation and to evaluate the contribution of moderators to heterogeneity. To demonstrate the application of the proposed statistical framework, a meta-analysis of Cohen's j is conducted for pressure ulcer classification systems. Implications and directions for future research are discussed.Keywords Cohen's j Á Inter-rater reliability Á Meta-analysis Á GeneralizabilityIn classical test theory proposed by Spearman (1904), an observed score X is expressed as the true score T plus a random error of measurement e, i.e., X = T ? e. Reliability is defined as the squared correlation between observed scores and true scores (Lord and Novick 1968). It indicates the extent to which scores produced by a particular measurement procedure are consistent and reproducible (Thorndike 2005). Reliability is an unobserved property of scores obtained from a sample on a particular test, not an inherent property of the test (Thompson 2002;Thompson and Vacha-Hasse 2000;Vacha-Hasse 1998;Vacha-Hasse et al. 2002). Therefore, it is never appropriate to claim a test is reliable or unreliable in a research article. Instead, researchers should state the scores are reliable or unreliable. Reliability estimates usually vary from one study to another due to differences in study characteristics including study settings, test properties, and subject characteristics. A test that yields reliable scores for one group of subjects in this setting may fail to yield reliable scores for a different group of subjects in another setting. Hence, understanding the generalizability of score reliability and the factors affecting score reliability becomes an important methodological issue.