Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2004
DOI: 10.1145/1008992.1009129
|View full text |Cite
|
Sign up to set email alerts
|

Context-based methods for text categorisation

Abstract: We propose several context-based methods for text categorization. One method, a small modification to the PPM compression-based model which is known to significantly degrade compression performance, counter-intuitively has the opposite effect on categorization performance. Another method, called C-measure, simply counts the presence of higher order character contexts, and outperforms all other approaches investigated. MOTIVATIONThe purpose of this paper is to report on the performance of various context based … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2007
2007
2023
2023

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 4 publications
0
3
0
Order By: Relevance
“…This value is derived from counts of character strings of a fixed length that are common to the documents being compared. 34 The method achieves improved results over the R-measure technique. For their part, the Italian researchers have been working on the creation of intermediate, 'artificial texts', based on typical sequences detected in the original sources by the LZ77 compression algorithm.…”
Section: C) Other Techniques: a Résumémentioning
confidence: 91%
“…This value is derived from counts of character strings of a fixed length that are common to the documents being compared. 34 The method achieves improved results over the R-measure technique. For their part, the Italian researchers have been working on the creation of intermediate, 'artificial texts', based on typical sequences detected in the original sources by the LZ77 compression algorithm.…”
Section: C) Other Techniques: a Résumémentioning
confidence: 91%
“…C-measure [36] makes use of k-mers. For a query sequence Q 1...m it checks its m−k + 1 k-mers (i.e., Q 1...k , Q 2...k+1 ,.…”
Section: Sen = Tp Tp + Fnmentioning
confidence: 99%
“…PPM uses a variable order Markov language modeling approach based on estimating probabilities of upcoming symbols on a finite prior context. An alternative to the probabilistic approach is to use simple count-based measures such as R-Measure [ 5 ] which is the normalized sum of the length of common substrings between the training and test streams, and C-Measure [ 6 ], a sum of the number of common substrings of a fixed length.…”
Section: Jscat -Text Stream Categorization Systemmentioning
confidence: 99%