KEYWORDSPreprocessing, PPM, UTF-8, Encoding.
BACKGROUND
Prediction by Partial Matching (PPM)One of the most powerful text compression techniques is Prediction by Partial Match (PPM), which was first introduced by Cleary and Witten [1]. A series of improvements have been applied to the original PPM algorithm, such as the PPMC version by Moffat [2] and PPM* by Cleary & Teahan [3]. The PPM text compression algorithm applies a statistical approach; it simply uses the set of previous symbols to predict the upcoming symbol in the stream. Variants of the PPM algorithm (such as PPMC and PPMD) are distinguished by the escape mechanism used to backoff to lower order models when new symbols are encountered in the context. PPM has also been applied successfully too many natural language processing (NLP) applications such as cryptology, language identification, and text correction [4], [5]. Abel and Teahan [6] presented several universal text preprocessing techniques that they applied prior to the application of various standard text compression algorithms. They found that in many cases the compression performance was significantly improved by applying the text processing techniques. In order to recover the original file during decoding, the decompression algorithm was applied first, and then postprocessing was performed that reversed the effect of the preprocessing stage.
Universal text preprocessing for data compression