Abstract:We examine the discrete distributional form that arises from the "classical occupancy problem, "which looks at the behavior of the number of occupied bins when we allocate a given number of balls uniformly at random to a given number of bins. We review the mass function and moments of the classical occupancy distribution and derive exact and asymptotic results for the mean, variance, skewness and kurtosis. We develop an algorithm to compute a cubic array of log-probabilities from the classical occupancy distri… Show more
“…Given n g , the number of parental lineages n p follows the modified occupancy distribution (also known as the Arfwedson distribution) [Wakeley, 2009, O'Neill, 2019, Johnson et al, 2005:…”
Section: S14 Distribution Of Number Of Sampled Lineagesmentioning
confidence: 99%
“…The occupancy distribution requires exchangeability of the alleles, which is satisfied by the condition that all parental alleles are derived. Combining the two distributions together through equation 16, we get: We did not find an analytical expression for this sum, but it can be computed efficiently using methods presented in [O'Neill, 2019]. Figure S9A (dotted line) shows the distribution of the number of contributing parental lineages for several selection coefficients with n o = 200.…”
Section: S14 Distribution Of Number Of Sampled Lineagesmentioning
confidence: 99%
“…The distribution over the number of gametes, n g , is given by the negative binomial, parameterized by the number of successes n o , and the probability of a successful draw is 1 — s . Given n g , the number of parental lineages n p follows the modified occupancy distribution (also known as the Arfwedson distribution) [Wakeley, 2009, O’Neill, 2019, Johnson et al, 2005]: where S 2 ( n g , n p ) is a Stirling number of the second kind, which is the number of ways to partition n g gametes into n p parents (see Johnson et al [2005] section 10.4 for a thorough treatment). The occupancy distribution requires exchangeability of the alleles, which is satisfied by the condition that all parental alleles are derived. Combining the two distributions together through equation 16, we get: …”
Section: Additional Figuresmentioning
confidence: 99%
“…The dotted line indicates regimes where the Gaussian approximation is likely inaccurate. In both panels, N = 1000. We did not find an analytical expression for this sum, but it can be computed efficiently using methods presented in [O’Neill, 2019]. Figure S9A (dotted line) shows the distribution of the number of contributing parental lineages for several selection coefficients with n o = 200.…”
Section: Additional Figuresmentioning
confidence: 99%
“…Given n g , the number of parental lineages n p follows the modified occupancy distribution (also known as the Arfwedson distribution) [Wakeley, 2009, O’Neill, 2019, Johnson et al, 2005]: where S 2 ( n g , n p ) is a Stirling number of the second kind, which is the number of ways to partition n g gametes into n p parents (see Johnson et al [2005] section 10.4 for a thorough treatment). The occupancy distribution requires exchangeability of the alleles, which is satisfied by the condition that all parental alleles are derived.…”
The fate of mutations and the genetic load of populations depend on the relative importance of genetic drift and natural selection. In addition, the accuracy of numerical models of evolution depends on the strength of both selection and drift: strong selection breaks the assumptions of the nearly neutral model, and drift coupled with large sample sizes breaks Kingman's coalescent model.
Thus, the regime with strong selection and large sample sizes, relevant to the study of pathogenic variation, appears particularly daunting. Surprisingly, we find that the interplay of drift and selection in that regime can be used to define asymptotically closed recursions for the distribution of allele frequencies that are accurate well beyond the strong selection limit.
Selection becomes more analytically tractable when the sample size n is larger than twice the population-scaled selection coefficient: n >= 2Ns (4Ns in diploids). That is, when the expected number of coalescent events in the sample is larger than the number of selective events. We construct the relevant transition matrices, show how they can be used to accurately compute distributions of allele frequencies, and show that the distribution of deleterious allele frequencies is sensitive to details of the evolutionary model.
“…Given n g , the number of parental lineages n p follows the modified occupancy distribution (also known as the Arfwedson distribution) [Wakeley, 2009, O'Neill, 2019, Johnson et al, 2005:…”
Section: S14 Distribution Of Number Of Sampled Lineagesmentioning
confidence: 99%
“…The occupancy distribution requires exchangeability of the alleles, which is satisfied by the condition that all parental alleles are derived. Combining the two distributions together through equation 16, we get: We did not find an analytical expression for this sum, but it can be computed efficiently using methods presented in [O'Neill, 2019]. Figure S9A (dotted line) shows the distribution of the number of contributing parental lineages for several selection coefficients with n o = 200.…”
Section: S14 Distribution Of Number Of Sampled Lineagesmentioning
confidence: 99%
“…The distribution over the number of gametes, n g , is given by the negative binomial, parameterized by the number of successes n o , and the probability of a successful draw is 1 — s . Given n g , the number of parental lineages n p follows the modified occupancy distribution (also known as the Arfwedson distribution) [Wakeley, 2009, O’Neill, 2019, Johnson et al, 2005]: where S 2 ( n g , n p ) is a Stirling number of the second kind, which is the number of ways to partition n g gametes into n p parents (see Johnson et al [2005] section 10.4 for a thorough treatment). The occupancy distribution requires exchangeability of the alleles, which is satisfied by the condition that all parental alleles are derived. Combining the two distributions together through equation 16, we get: …”
Section: Additional Figuresmentioning
confidence: 99%
“…The dotted line indicates regimes where the Gaussian approximation is likely inaccurate. In both panels, N = 1000. We did not find an analytical expression for this sum, but it can be computed efficiently using methods presented in [O’Neill, 2019]. Figure S9A (dotted line) shows the distribution of the number of contributing parental lineages for several selection coefficients with n o = 200.…”
Section: Additional Figuresmentioning
confidence: 99%
“…Given n g , the number of parental lineages n p follows the modified occupancy distribution (also known as the Arfwedson distribution) [Wakeley, 2009, O’Neill, 2019, Johnson et al, 2005]: where S 2 ( n g , n p ) is a Stirling number of the second kind, which is the number of ways to partition n g gametes into n p parents (see Johnson et al [2005] section 10.4 for a thorough treatment). The occupancy distribution requires exchangeability of the alleles, which is satisfied by the condition that all parental alleles are derived.…”
The fate of mutations and the genetic load of populations depend on the relative importance of genetic drift and natural selection. In addition, the accuracy of numerical models of evolution depends on the strength of both selection and drift: strong selection breaks the assumptions of the nearly neutral model, and drift coupled with large sample sizes breaks Kingman's coalescent model.
Thus, the regime with strong selection and large sample sizes, relevant to the study of pathogenic variation, appears particularly daunting. Surprisingly, we find that the interplay of drift and selection in that regime can be used to define asymptotically closed recursions for the distribution of allele frequencies that are accurate well beyond the strong selection limit.
Selection becomes more analytically tractable when the sample size n is larger than twice the population-scaled selection coefficient: n >= 2Ns (4Ns in diploids). That is, when the expected number of coalescent events in the sample is larger than the number of selective events. We construct the relevant transition matrices, show how they can be used to accurately compute distributions of allele frequencies, and show that the distribution of deleterious allele frequencies is sensitive to details of the evolutionary model.
We examine the negative occupancy distribution and the coupon-collector distribution, both of which arise as distributions relating to hitting times in the extended occupancy problem. These distributions constitute a full solution to a generalised version of the coupon collector problem, by describing the behaviour of the number of items we need to collect to obtain a full collection or a partial collection of any size. We examine the properties of these distributions and show how they can be computed and approximated. We give some practical guidance on the feasibility of computing large blocks of values from the distributions, and when approximation is required.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.