ObjectiveTo determine the level of agreement of disease flare severity (distinguishing severe, moderate, and mild flare and persistent disease activity) in a large paper‐patient exercise involving 988 individual cases of systemic lupus erythematosus.MethodsA total of 988 individual lupus case histories were assessed by 3 individual physicians. Complete agreement about the degree of flare (or persistent disease activity) was obtained in 451 cases (46%), and these provided the reference standard for the second part of the study. This component used 3 flare activity instruments (the British Isles Lupus Assessment Group [BILAG] 2004, Safety of Estrogens in Lupus Erythematosus National Assessment [SELENA] flare index [SFI] and the revised SELENA flare index [rSFI]). The 451 patient case histories were distributed to 18 pairs of physicians, carefully randomized in a manner designed to ensure a fair case mix and equal distribution of flare according to severity.ResultsThe 3‐physician assessment of flare matched the level of flare using the 3 indices, with 67% for BILAG 2004, 72% for SFI, and 70% for rSFI. The corresponding weighted kappa coefficients for each instrument were 0.82, 0.59, and 0.74, respectively. We undertook a detailed analysis of the discrepant cases and several factors emerged, including a tendency to score moderate flares as severe and persistent activity as flare, especially when the SFI and rSFI instruments were used. Overscoring was also driven by scoring treatment change as flare, even if there were no new or worsening clinical features.ConclusionGiven the complexity of assessing lupus flare, we were encouraged by the overall results reported. However, the problem of capturing lupus flare accurately is not completely solved.