Estimating the number of unseen species is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher et al. [Fisher RA, Corbet AS, Williams CB (1943) J Animal Ecol 12(1):42−58], uses n samples to predict the number U of hitherto unseen species that would be observed if t · n new samples were collected. Of considerable interest is the largest ratio t between the number of new and existing samples for which U can be accurately predicted. In seminal works, Good and Toulmin [Good I, Toulmin G (1956) Biometrika 43(102):45−63] constructed an intriguing estimator that predicts U for all t ≤ 1. Subsequently, Efron and Thisted [Efron B, Thisted R (1976) Biometrika 63(3):435−447] proposed a modification that empirically predicts U even for some t > 1, but without provable guarantees. We derive a class of estimators that provably predict U all of the way up to t ∝ log n. We also show that this range is the best possible and that the estimator's mean-square error is near optimal for any t. Our approach yields a provable guarantee for the Efron−Thisted estimator and, in addition, a variant with stronger theoretical and experimental performance than existing methodologies on a variety of synthetic and real datasets. The estimators are simple, linear, computationally efficient, and scalable to massive datasets. Their performance guarantees hold uniformly for all distributions, and apply to all four standard sampling models commonly used across various scientific disciplines: multinomial, Poisson, hypergeometric, and Bernoulli product.species estimation | extrapolation model | nonparametric statistics S pecies estimation is an important problem in numerous scientific disciplines. Initially used to estimate ecological diversity (1-4), it was subsequently applied to assess vocabulary size (5, 6), database attribute variation (7), and password innovation (8). Recently, it has found a number of bioscience applications, including estimation of bacterial and microbial diversity (9-12), immune receptor diversity (13), complexity of genomic sequencing (14), and unseen genetic variations (15).All approaches to the problem incorporate a statistical model, with the most popular being the "extrapolation model" introduced by Fisher, Corbet, and Williams (16) in 1943. It assumes that n independent samples X n ≜ X 1 , . . . , X n were collected from an unknown distribution p, and calls for estimatingthe number of hitherto unseen symbols that would be observed if m additional samples X n+m n + 1 ≜ X n+1 , . . . , X n+m were collected from the same distribution.In 1956, Good and Toulmin (17) predicted U by a fascinating estimator that has since intrigued statisticians and a broad range of scientists alike (18). For example, in the Stanford University Statistics Department brochure (19), published in the early 1990s and slightly abbreviated here, Bradley Efron credited the problem and its elegant solution with kindling his interest in statistics. As we shall soon see, Efron, along with Ronald Thisted, ...