To ensure adequate reliability (i.e., internal consistency), it is common in studies using eventrelated brain potentials (ERPs) to exclude participants for having too few trials. This practice is particularly relevant for error-related ERPs, such as error-related negativity (ERN), where the number of recorded ERN trials is not entirely under the researcher's control. Furthermore, there is a widespread practice of inferring reliability based on published psychometric research, which assumes that internal consistency is a universal property of ERN. The present, preregistered reliability generalization study examined whether there is heterogeneity in internal consistency estimates of ERN scores and whether contextual factors moderate reliability. A total of 189 internal consistency estimates from 68 samples nested within 43 studies (n = 4,499 total participants) were analyzed. There was substantial heterogeneity in ERN score internal consistency, which was partially moderated by the type of paradigm (e.g., Stroop, flanker), the clinical status of the sample, the ocular artifact correction procedure, measurement sensors (single vs. cluster), and the approach to scoring and estimating reliability, suggesting that contextual factors impact internal consistency at the individual study level. Age, sex, year of publication, artifact rejection procedure, acquisition system, sample type (undergraduate vs. community), and length of mean amplitude window did not significantly moderate reliability. Notably, the overall estimated reliability of ERN scores was below established standards. Recommendations for improving ERN score reliability are provided, but the routine failure of most ERN studies to report internal consistency represents a substantial barrier to understanding the factors that impact reliability.