A repetition based measure for verification of text collections and for text categorization

Khmelev, Dmitry V.; Teahan, William J.

doi:10.1145/860454.860456

Cited by 12 publications

(13 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the same experimental setup that was used in [2] to compare the different methods. An experimental collection to test authorship attribution was formed from Reuters Corpus Volume 1 (RCV1), the extended Reuters corpus of over 800K news articles.…”

Section: Resultsmentioning

confidence: 99%

“…The second column lists results for the PPMD implementation (with exclusions), the third column for PPMD without exclusions, and the fourth column for C-measure. For comparison, results for several other methods can be found in [2]. The best performing methods were as follows: SVM (85.0%); RMeasure, a repetition-based approach (89.0%); and RAR, an off-the-shelf PPM-based compression method (89.4%).…”

Section: Resultsmentioning

confidence: 99%

“…overheads and statistical inaccuracies from insufficient training data. Results published in [2] also strongly indicate that a character-based approach may in fact be superior to the more traditional approach for authorship attribution tasks.…”

Section: Motivationmentioning

confidence: 94%

Section: Motivationmentioning

confidence: 99%

“…Hence we seek a general classifier that is able to cope well with a variety of different data streams, including user interactions, mouse movements, and textual input. An important component of authentication is authorship attribution, a multi-cass text categorization problem, so our work described here draws upon and extends on the work described in [2].The traditional machine learning approach to text categorization is either word or feature-based [1]. However, we expect to encounter very different data streams to the textual data that is encountered in standard text categorization tasks.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Context-based methods for text categorisation

Hunnisett¹,

Teahan

2004

Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

We propose several context-based methods for text categorization. One method, a small modification to the PPM compression-based model which is known to significantly degrade compression performance, counter-intuitively has the opposite effect on categorization performance. Another method, called C-measure, simply counts the presence of higher order character contexts, and outperforms all other approaches investigated. MOTIVATIONThe purpose of this paper is to report on the performance of various context based methods for text categorisation. This work is part of a larger project incorporating biometrics for authentication, user modelling/profiling and user interface optimisations. Hence we seek a general classifier that is able to cope well with a variety of different data streams, including user interactions, mouse movements, and textual input. An important component of authentication is authorship attribution, a multi-cass text categorization problem, so our work described here draws upon and extends on the work described in [2].The traditional machine learning approach to text categorization is either word or feature-based [1]. However, we expect to encounter very different data streams to the textual data that is encountered in standard text categorization tasks. In these situations, the notion of a word may simply be meaningless, and the definition of what constitutes a useful feature both non-obvious and problematical. A character-based approach on the other hand neatly sidesteps these issues by treating all information as important, and the standard Markov-based assumption of restricting to fixed order contexts can be applied to reduce the computational Copyright is held by the author/owner. SIGIR'04 July 25-29 2004 Sheffield, South Yorkshire, UK ACM 1-58113-881-4/04/0007. overheads and statistical inaccuracies from insufficient training data. Results published in [2] also strongly indicate that a character-based approach may in fact be superior to the more traditional approach for authorship attribution tasks.In the next sections, we describe two context based methods, one a relative entropy based method, and another a simple context counting method. Experimental results are subsequently discussed, with discussion in the final section. PPM-BASED LANGUAGE MODELSThe main idea behind using relative entropy for classification is that it well ground in information theory. To guess the correct class of the document, one uses the formulâwhere H(T | M ) is some approximation to the relative entropy of text T with respect to the model M . Compression models encode a document according to a given model, and can provide an upper estimate to the relative entropy. With PPM, a well-performed text compression scheme, individual symbols are encoded using the context provided by the immediately preceding symbols [3]:where contexti = x1, x2, . . . xi−1. In practice, PPM uses a Markov approximation and assumes a fixed order context. PPM uses an approximate blending technique called exclusions based on the escape mechanism t...

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Motivationmentioning

confidence: 94%

Section: Motivationmentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations