“…For example, Essen & Steinbiss (1992) report that in a 75%-25% split of the million-word LOB corpus, 12% of the bigrams in the test partition did not occur in the training portion. For trigrams, the sparse data problem is even more severe: for instance, researchers at IBM (Brown, DellaPietra, deSouza, Lai, & Mercer, 1992) examined a training corpus consisting of almost 366 million English words, and discovered that one can expect 14.7% of the word triples in any new English text to be absent from the training sample.…”