Compared to a neutral model, purifying selection distorts the structure of genealogies and hence alters the patterns of sampled genetic variation. Although these distortions may be common in nature, our understanding of how we expect purifying selection to affect patterns of molecular variation remains incomplete. Genealogical approaches such as coalescent theory have proven difficult to generalize to situations involving selection at many linked sites, unless selection pressures are extremely strong. Here, we introduce an effective coalescent theory (a "fitness-class coalescent") to describe the structure of genealogies in the presence of purifying selection at many linked sites. We use this effective theory to calculate several simple statistics describing the expected patterns of variation in sequence data, both at the sites under selection and at linked neutral sites. Our analysis combines a description of the allele frequency spectrum in the presence of purifying selection with the structured coalescent approach of Kaplan et al. (1988), to trace the ancestry of individuals through the distribution of fitnesses within the population. We also derive our results using a more direct extension of the structured coalescent approach of Hudson and Kaplan (1994). We find that purifying selection leads to patterns of genetic variation that are related but not identical to a neutrally evolving population in which population size has varied in a specific way in the past.
PURIFYING selection acting simultaneously at many linked sites ("background selection") can substantially alter the patterns of molecular variation at these sites and at linked neutral sites (Hill and Robertson 1966;Kaplan et al. 1988; Kaplan 1994, 1995;McVean and Charlesworth 2000;Gordo et al. 2002;O'Fallon et al. 2010;Seger et al. 2010). In recent years, evidence from sequence data points to the general importance of these selective forces among many linked variants in microbial and viral populations and on short distance scales in the genomes of sexual organisms (Comeron et al. 2008;Hahn 2008;Seger et al. 2010). In these situations, existing theory does not fully explain patterns of molecular evolution (Hahn 2008).It is difficult to incorporate negative selection at many linked sites into genealogical frameworks such as coalescent theory, because these frameworks typically rely on characterizing the space of possible genealogical trees before considering the possibility of mutations at various locations on these trees. When selection operates, the probabilities of particular trees cannot be defined independently of the mutations, and the approach breaks down (Tavare 2004;Wakeley 2009).Despite this difficulty, a number of productive approaches have been developed to predict how negative selection influences patterns of molecular variation and to infer selection pressures from data. Charlesworth et al. (1993) introduced the background selection model and showed that strong purifying selection reduces the effective population size relevant for...