The lack of standardized criteria for measuring therapeutic response is a major obstacle to the development of new therapeutic agents for chronic graft-versus-host disease (cGVHD). National Institutes of Health (NIH) consensus criteria for evaluating therapeutic response were published in 2006. We report the results of four consecutive pilot trials evaluating the feasibility and estimating the inter-rater reliability and minimum detectable change of these response criteria.
Hematology-oncology clinicians with limited experience in applying the NIH cGVHD response criteria (n=34), participated in a 2.5 hour training session on response evaluation in cGVHD. Feasibility and inter-rater reliability between subspecialty cGVHD experts and this panel of clinician raters were examined in a sample of 25 children and adults with cGVHD. The minimum detectable change was calculated using the standard error of measurement.
Clinicians’ impressions of the brief training session, the photo atlas, and the response criteria documentation tools were generally favorable. Performing and documenting the full set of response evaluations required a median of 21 minutes (range 12 to 60 minutes) per rater. The Schirmer tear test required the greatest time of any single test (median 9 minutes). Overall, inter-rater agreement for skin and oral manifestations was modest, however, in the third and fourth trials, the agreement between clinicians and experts for all dimensions except movable sclerosis approached satisfactory values. In the final two trials, the threshold for defining change exceeding measurement error was 19–22% body surface area (BSA) for erythema, 18–26% BSA for movable sclerosis, 17–21% BSA for nonmovable sclerosis, and 2.1–2.6 points on the 15 point NIH Oral cGHVD scale. Agreement between clinician-expert pairs was moderate to substantial for the measures of functional capacity and for the gastrointestinal and global cGVHD rating scales.
These results suggest that the NIH response criteria are feasible for use, and these reliability estimates are encouraging, because they were observed following a single 2.5 hour training session given at multiple transplant centers, with no opportunity for iterative training and calibration. Research is needed to evaluate inter- and intra-rater reliability in larger samples, and to evaluate these response criteria as predictors of outcomes in clinical trials.