We propose several context-based methods for text categorization. One method, a small modification to the PPM compression-based model which is known to significantly degrade compression performance, counter-intuitively has the opposite effect on categorization performance. Another method, called C-measure, simply counts the presence of higher order character contexts, and outperforms all other approaches investigated.
MOTIVATIONThe purpose of this paper is to report on the performance of various context based methods for text categorisation. This work is part of a larger project incorporating biometrics for authentication, user modelling/profiling and user interface optimisations. Hence we seek a general classifier that is able to cope well with a variety of different data streams, including user interactions, mouse movements, and textual input. An important component of authentication is authorship attribution, a multi-cass text categorization problem, so our work described here draws upon and extends on the work described in [2].The traditional machine learning approach to text categorization is either word or feature-based [1]. However, we expect to encounter very different data streams to the textual data that is encountered in standard text categorization tasks. In these situations, the notion of a word may simply be meaningless, and the definition of what constitutes a useful feature both non-obvious and problematical. A character-based approach on the other hand neatly sidesteps these issues by treating all information as important, and the standard Markov-based assumption of restricting to fixed order contexts can be applied to reduce the computational Copyright is held by the author/owner. SIGIR'04 July 25-29 2004 Sheffield, South Yorkshire, UK ACM 1-58113-881-4/04/0007. overheads and statistical inaccuracies from insufficient training data. Results published in [2] also strongly indicate that a character-based approach may in fact be superior to the more traditional approach for authorship attribution tasks.In the next sections, we describe two context based methods, one a relative entropy based method, and another a simple context counting method. Experimental results are subsequently discussed, with discussion in the final section.
PPM-BASED LANGUAGE MODELSThe main idea behind using relative entropy for classification is that it well ground in information theory. To guess the correct class of the document, one uses the formulâwhere H(T | M ) is some approximation to the relative entropy of text T with respect to the model M . Compression models encode a document according to a given model, and can provide an upper estimate to the relative entropy. With PPM, a well-performed text compression scheme, individual symbols are encoded using the context provided by the immediately preceding symbols [3]:where contexti = x1, x2, . . . xi−1. In practice, PPM uses a Markov approximation and assumes a fixed order context. PPM uses an approximate blending technique called exclusions based on the escape mechanism t...