Abstract-The description of a novel type of rn-gram language model is given. The model offers, via a nonlinear recursive procedure, a computation and space efficient solution to the problem of estimating probabilities from sparse data. This solution compares favorably to other proposed methods. While the method has been developed for and successfully implemented in the IBM Real Time Speech Recognizers, its generality makes it applicable in other areas where the problem of estimating probabilities from sparse data arises.Sparseness of data is an inherent property of any real text, and it is a problem that one always encounters while collecting frequency statistics on words and word sequences (m-grams) from a text of finite size. This means that even for a very large data collection, the maximum likelihood estimation method does not allow us to adequately estimate probabilities of rare but nevertheless possible word sequences-many sequences occur only once ("singletons"); many more do not occur at all. Inadequacy of the maximum likelihood estimator and the necessity to estimate the probabilities of m-grams which did not occur in the text constitute the essence of the problem.The main idea of the proposed solution to the problem is to reduce unreliable probability estimates given by the observed frequencies and redistribute the "freed" probability "mass" among m-grams which never occurred in the text. The reduction is achieved by replacing maximum likelihood estimates for m-grams having low counts with renormalized Turing's estimates [l], and the redistribution is done via the recursive utilization of lower level conditional distributions. We found Turing's method attractive because of its simplicity and its characterization as the optimal empirical Bayes' estimator of a multinomial probability. Robbins in [2] introduces the empirical Bayes' methodology and Nadas in [3] gives various derivations of the Turing's formula.Let N be a sample text size and let n, be the number of words (m-grams) which occurred in the text exactly r times, so that(1) Turing's estimate PT for a probability of a word (m-gram) which occurred in the sample r times is r r* PT = where We call a procedure of replacing a count r with a modified count r' "discounting" and a ratio r t / r a discount coefficient dr. When r' = r * , we have Turing's discounting.Let us denote the m-gram w l , * . . , w, as wy and the number of times it occurred in the sample text as c ( w T ) . Then the maximum likelihood estimate is