It has become increasingly clear that the COVID-19 epidemic is characterized by overdispersion whereby the majority of the transmission is driven by a minority of infected individuals. Such a strong departure from the homogeneity assumptions of traditional well-mixed compartment model is usually hypothesized to be the result of short-term super-spreader events, such as individual's extreme rate of virus shedding at the peak of infectivity while attending a large gathering without appropriate mitigation. However, heterogeneity can also arise through long-term, or persistent variations in individual susceptibility or infectivity. Here, we show how to incorporate persistent heterogeneity into a wide class of epidemiological models, and derive a non-linear dependence of the effective reproduction number R_e on the susceptible population fraction S. Persistent heterogeneity has three important consequences compared to the effects of overdispersion: (1) It results in a major modification of the early epidemic dynamics; (2) It significantly suppresses the herd immunity threshold; (3) It significantly reduces the final size of the epidemic. We estimate social and biological contributions to persistent heterogeneity using data on real-life face-to-face contact networks and age variation of the incidence rate during the COVID-19 epidemic, and show that empirical data from the COVID-19 epidemic in New York City (NYC) and Chicago and all 50 US states provide a consistent characterization of the level of persistent heterogeneity. Our estimates suggest that the hardest-hit areas, such as NYC, are close to the persistent heterogeneity herd immunity threshold following the first wave of the epidemic, thereby limiting the spread of infection to other regions during a potential second wave of the epidemic. Our work implies that general considerations of persistent heterogeneity in addition to overdispersion act to limit the scale of pandemics.