Models of English text

Teahan, William J.; Cleary, John G.

doi:10.1109/dcc.1997.581953

Cited by 21 publications

(33 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For encoding the vocabulary output file, standard order 1 byte-based PPM is quite effective. For the symbols output file, where the symbol numbers can get quite large for some languages, a similar technique to word-based PPM [4] works well with the alphabet size being unbounded. Another finding is that an order 4 model works best among the experimented languages.…”

Section: Preprocessing and Postprocessingmentioning

confidence: 99%

“…Variants of the PPM algorithm (such as PPMC and PPMD) are distinguished by the escape mechanism used to backoff to lower order models when new symbols are encountered in the context. PPM has also been applied successfully too many natural language processing (NLP) applications such as cryptology, language identification, and text correction [4], [5].…”

Section: Prediction By Partial Matching (Ppm)mentioning

confidence: 99%

“…One method first described in [4] that they found effective for English text is to substitute bigraphs with a single further unique symbol (essentially expanding the alphabet). This bigraph substitution method (described in more detail in section 2), however, was only applied to English ASCII text and its effectiveness for other languages, and other encoding schemes (such as UTF- 8) has not been explored previously.…”

Section: Universal Text Preprocessing For Data Compressionmentioning

confidence: 99%

See 2 more Smart Citations

Preprocessing for PPM: Compressing Utf-8 Encoded Natural Language Text

J.Teahan¹,

M.Alhawiti²

2015

IJCSIT

View full text Add to dashboard Cite

KEYWORDSPreprocessing, PPM, UTF-8, Encoding. BACKGROUND Prediction by Partial Matching (PPM)One of the most powerful text compression techniques is Prediction by Partial Match (PPM), which was first introduced by Cleary and Witten [1]. A series of improvements have been applied to the original PPM algorithm, such as the PPMC version by Moffat [2] and PPM* by Cleary & Teahan [3]. The PPM text compression algorithm applies a statistical approach; it simply uses the set of previous symbols to predict the upcoming symbol in the stream. Variants of the PPM algorithm (such as PPMC and PPMD) are distinguished by the escape mechanism used to backoff to lower order models when new symbols are encountered in the context. PPM has also been applied successfully too many natural language processing (NLP) applications such as cryptology, language identification, and text correction [4], [5]. Abel and Teahan [6] presented several universal text preprocessing techniques that they applied prior to the application of various standard text compression algorithms. They found that in many cases the compression performance was significantly improved by applying the text processing techniques. In order to recover the original file during decoding, the decompression algorithm was applied first, and then postprocessing was performed that reversed the effect of the preprocessing stage. Universal text preprocessing for data compression

show abstract

Section: Preprocessing and Postprocessingmentioning

confidence: 99%

Section: Prediction By Partial Matching (Ppm)mentioning

confidence: 99%

Section: Universal Text Preprocessing For Data Compressionmentioning

confidence: 99%

See 1 more Smart Citation

Preprocessing for PPM: Compressing Utf-8 Encoded Natural Language Text

J.Teahan¹,

M.Alhawiti²

2015

IJCSIT

View full text Add to dashboard Cite

show abstract

“…Therefore, we report in this paper results on the use of PPM on natural language texts as well as results on the Calgary Corpus, a standard corpus used to compare text compression algorithms. PPM has achieved excellent results in various natural language processing applications such as language identification and segmentation, text categorisation, cryptology, and optical character recognition (OCR) [7].…”

Section: Prediction By Partial Matchingmentioning

confidence: 99%

“…Then the probability for all symbols or characters will be estimated and encoded by | | whereA is the size of alphabets in the contexts. The experiments show the maximum order that usually gets good compression rates for English is five [1][8] [7]. For Arabic text, the experiments show that order seven the PPM algorithm gives a good compression rate [9].…”

Section: Prediction By Partial Matchingmentioning

confidence: 99%