BackgroundFemoral head fractures are rare but potentially disabling injuries, and classifying them accurately and consistently can help surgeons make good choices about their treatment. However, there is no consensus as to which classification of these fractures is the most advantageous; parameters that might inform this choice include universality (the proportion of fractures that can be classified), as well as, of course, interobserver and intraobserver reproducibility.Questions/purposes(1) Which classification achieves the best universality (defined as the proportion of fractures that can be classified)? (2) Which classification delivers the highest intraobserver and interobserver reproducibility in the clinical CT assessment of femoral head fractures? (3) Based on the answers to those two questions, which classifications are the most applicable for clinical practice and research?MethodsBetween January 2011 and January 2023, 254 patients with femoral head fractures who had CT scans (CT is routine at our institution for patients who have experienced severe hip trauma) were potentially eligible for inclusion in this study, which was performed at a large Level I trauma center in China. Of those, 9% (23 patients) were excluded because of poor-quality CT images, unclosed physes, pathologic fractures, or acetabular dysplasia, leaving 91% (231 patients with 231 hips) for analysis here. Among those, 19% (45) were female. At the time of injury, the mean age was 40 ± 17 years. All fractures were independently classified by four observers according to the Pipkin, Brumback, AO/Orthopaedic Trauma Association (OTA), Chiron, and New classifications. Each observer repeated his classifications again 1 month later to allow us to ascertain intraobserver reliability. To evaluate the universality of classifications, we characterized the percentage of hips that could be classified using the definitions offered in each classification. The kappa (κ) value was calculated to determine interrater and intrarater agreement. We then compared the classifications based on the combination of universality and interobserver and intraobserver reproducibility to determine which classifications might be recommended for clinical and research use.ResultsThe universalities of the classifications were 99% (228 of 231, Pipkin), 43% (99 of 231, Brumback), 94% (216 of 231, AO/OTA), 99% (228 of 231, Chiron), and 100% (231 of 231, New). The interrater agreement was judged as almost perfect (κ 0.81 [95% CI 0.78 to 0.84], Pipkin), moderate (κ 0.51 [95% CI 0.44 to 0.59], Brumback), fair (κ 0.28 [95% CI 0.18 to 0.38], AO/OTA), substantial (κ 0.79 [95% CI 0.76 to 0.82], Chiron), and substantial (κ 0.63 [95% CI 0.58 to 0.68], New). In addition, the intrarater agreement was judged as almost perfect (κ 0.89 [95% CI 0.83 to 0.96]), substantial (κ 0.72 [95% CI 0.69 to 0.75]), moderate (κ 0.51 [95% CI 0.43 to 0.58]), almost perfect (κ 0.87 [95% CI 0.82 to 0.91]), and substantial (κ 0.78 [95% CI 0.59 to 0.97]), respectively. Based on these findings, we determin...