Estimating the Prediction Function and the Number of Unseen Species in Sampling with Replacement

Boneh, Shahar; Boneh, A.; Caron, R. J.

doi:10.2307/2669633

Cited by 17 publications

(23 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By using the estimator of Boneh et al (5) and applying the suggested bias reduction procedure, we calculated that there are 68 unseen species in addition to the 347 found, for a total estimate of 415 subgingival species. Table 3 shows the number of additional species that one would expect to identify by examining various numbers of additional clones.…”

Section: Resultsmentioning

confidence: 99%

“…N k is the number of species seen k times. Bias correction and other details of the estimator are beyond the scope of the current discussion, and the reader is referred to the publication of Boneh et al (5).…”

Section: Subject Populations (I) Refractory Periodontitismentioning

confidence: 99%

“…In addition to ecological studies of biological diversity, similar methods have been used to estimate the number of words known but not used by Shakespeare and other authors (12). In this study, the number of unseen species that were missed was calculated with an improved estimator as proposed by Boneh et al (5). The estimator is based on a continuous time model of parallel independent Poisson processes.…”

Section: Subject Populations (I) Refractory Periodontitismentioning

confidence: 99%

See 2 more Smart Citations

Bacterial Diversity in Human Subgingival Plaque

Paster

Boches²,

Galvin³

et al. 2001

J Bacteriol

1,783

1,742

View full text Add to dashboard Cite

The purpose of this study was to determine the bacterial diversity in the human subgingival plaque by using culture-independent molecular methods as part of an ongoing effort to obtain full 16S rRNA sequences for all cultivable and not-yet-cultivated species of human oral bacteria. Subgingival plaque was analyzed from healthy subjects and subjects with refractory periodontitis, adult periodontitis, human immunodeficiency virus periodontitis, and acute necrotizing ulcerative gingivitis. 16S ribosomal DNA (rDNA) bacterial genes from DNA isolated from subgingival plaque samples were PCR amplified with all-bacterial or selective primers and cloned into Escherichia coli. The sequences of cloned 16S rDNA inserts were used to determine species identity or closest relatives by comparison with sequences of known species. A total of 2,522 clones were analyzed. Nearly complete sequences of approximately 1,500 bases were obtained for putative new species. About 60% of the clones fell into 132 known species, 70 of which were identified from multiple subjects. About 40% of the clones were novel phylotypes. Of the 215 novel phylotypes, 75 were identified from multiple subjects. Known putative periodontal pathogens such as Porphyromonas gingivalis, Bacteroides forsythus, and Treponema denticola were identified from multiple subjects, but typically as a minor component of the plaque as seen in cultivable studies. Several phylotypes fell into two recently described phyla previously associated with extreme natural environments, for which there are no cultivable species. A number of species or phylotypes were found only in subjects with disease, and a few were found only in healthy subjects. The organisms identified only from diseased sites deserve further study as potential pathogens. Based on the sequence data in this study, the predominant subgingival microbial community consisted of 347 species or phylotypes that fall into 9 bacterial phyla. Based on the 347 species seen in our sample of 2,522 clones, we estimate that there are 68 additional unseen species, for a total estimate of 415 species in the subgingival plaque. When organisms found on other oral surfaces such as the cheek, tongue, and teeth are added to this number, the best estimate of the total species diversity in the oral cavity is approximately 500 species, as previously proposed.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Subject Populations (I) Refractory Periodontitismentioning

confidence: 99%

Section: Subject Populations (I) Refractory Periodontitismentioning

confidence: 99%

See 1 more Smart Citation

Bacterial Diversity in Human Subgingival Plaque

Paster

Boches²,

Galvin³

et al. 2001

J Bacteriol

1,783

1,742

View full text Add to dashboard Cite

show abstract

“…0). Good and Toulmin (1956) derived a prediction formula, but their estimator lacks some theoretical properties of the prediction function (Boneh et al 1998) and may take negative values or become extremely large if m* . n; see Chao and Shen (2004) for examples.…”

Section: Analytic Approachmentioning

confidence: 99%

Coverage‐based rarefaction and extrapolation: standardizing samples by completeness rather than size

Chao

Jost²

2012

Ecology

1,634

1,432

View full text Add to dashboard Cite

Abstract. We propose an integrated sampling, rarefaction, and extrapolation methodology to compare species richness of a set of communities based on samples of equal completeness (as measured by sample coverage) instead of equal size. Traditional rarefaction or extrapolation to equal-sized samples can misrepresent the relationships between the richnesses of the communities being compared because a sample of a given size may be sufficient to fully characterize the lower diversity community, but insufficient to characterize the richer community. Thus, the traditional method systematically biases the degree of differences between community richnesses. We derived a new analytic method for seamless coverage-based rarefaction and extrapolation. We show that this method yields less biased comparisons of richness between communities, and manages this with less total sampling effort. When this approach is integrated with an adaptive coverage-based stopping rule during sampling, samples may be compared directly without rarefaction, so no extra data is taken and none is thrown away. Even if this stopping rule is not used during data collection, coveragebased rarefaction throws away less data than traditional size-based rarefaction, and more efficiently finds the correct ranking of communities according to their true richnesses. Several hypothetical and real examples demonstrate these advantages.

show abstract

“…As argued in ref. 33, it is often useful for species estimators to be monotone and concave in the extrapolation ratio t, which, however, need not be satisfied by linear estimators such as Good−Toulmin or SGT estimators. In SI Appendix, section 6, we propose a simple modification of the SGT estimator that is both monotone and concave, which retains the good empirical performance of the original estimator.…”

Section: Methodsmentioning

confidence: 99%

Optimal prediction of the number of unseen species

Orlitsky

Suresh

2016

Proc. Natl. Acad. Sci. U.S.A.

131

View full text Add to dashboard Cite

Estimating the number of unseen species is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher et al. [Fisher RA, Corbet AS, Williams CB (1943) J Animal Ecol 12(1):42−58], uses n samples to predict the number U of hitherto unseen species that would be observed if t · n new samples were collected. Of considerable interest is the largest ratio t between the number of new and existing samples for which U can be accurately predicted. In seminal works, Good and Toulmin [Good I, Toulmin G (1956) Biometrika 43(102):45−63] constructed an intriguing estimator that predicts U for all t ≤ 1. Subsequently, Efron and Thisted [Efron B, Thisted R (1976) Biometrika 63(3):435−447] proposed a modification that empirically predicts U even for some t > 1, but without provable guarantees. We derive a class of estimators that provably predict U all of the way up to t ∝ log n. We also show that this range is the best possible and that the estimator's mean-square error is near optimal for any t. Our approach yields a provable guarantee for the Efron−Thisted estimator and, in addition, a variant with stronger theoretical and experimental performance than existing methodologies on a variety of synthetic and real datasets. The estimators are simple, linear, computationally efficient, and scalable to massive datasets. Their performance guarantees hold uniformly for all distributions, and apply to all four standard sampling models commonly used across various scientific disciplines: multinomial, Poisson, hypergeometric, and Bernoulli product.species estimation | extrapolation model | nonparametric statistics S pecies estimation is an important problem in numerous scientific disciplines. Initially used to estimate ecological diversity (1-4), it was subsequently applied to assess vocabulary size (5, 6), database attribute variation (7), and password innovation (8). Recently, it has found a number of bioscience applications, including estimation of bacterial and microbial diversity (9-12), immune receptor diversity (13), complexity of genomic sequencing (14), and unseen genetic variations (15).All approaches to the problem incorporate a statistical model, with the most popular being the "extrapolation model" introduced by Fisher, Corbet, and Williams (16) in 1943. It assumes that n independent samples X n ≜ X 1 , . . . , X n were collected from an unknown distribution p, and calls for estimatingthe number of hitherto unseen symbols that would be observed if m additional samples X n+m n + 1 ≜ X n+1 , . . . , X n+m were collected from the same distribution.In 1956, Good and Toulmin (17) predicted U by a fascinating estimator that has since intrigued statisticians and a broad range of scientists alike (18). For example, in the Stanford University Statistics Department brochure (19), published in the early 1990s and slightly abbreviated here, Bradley Efron credited the problem and its elegant solution with kindling his interest in statistics. As we shall soon see, Efron, along with Ronald Thisted, ...

show abstract

Estimating the Prediction Function and the Number of Unseen Species in Sampling with Replacement

Cited by 17 publications

References 0 publications

Bacterial Diversity in Human Subgingival Plaque

Bacterial Diversity in Human Subgingival Plaque

Coverage‐based rarefaction and extrapolation: standardizing samples by completeness rather than size

Optimal prediction of the number of unseen species

Contact Info

Product

Resources

About