Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval 2003
DOI: 10.1145/860435.860456
|View full text |Cite
|
Sign up to set email alerts
|

A repetition based measure for verification of text collections and for text categorization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2008
2008
2011
2011

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 60 publications
(32 citation statements)
references
References 10 publications
0
32
0
Order By: Relevance
“…Sanderson and Guenter (2006) described the use of several sequence kernels based on character n-grams of variable-length and the best results for short English texts were achieved when examining sequences of up to 4-grams. Moreover, various Markov models of variable order have been proposed for handling character-level information (Khmelev & Teahan, 2003a;Marton, et al, 2005). Finally, Zhang and Lee (2006) constructed a suffix tree representing all possible character n-grams of variable-length and then extracted groups of character n-grams as features.…”
Section: Character Featuresmentioning
confidence: 99%
See 2 more Smart Citations
“…Sanderson and Guenter (2006) described the use of several sequence kernels based on character n-grams of variable-length and the best results for short English texts were achieved when examining sequences of up to 4-grams. Moreover, various Markov models of variable order have been proposed for handling character-level information (Khmelev & Teahan, 2003a;Marton, et al, 2005). Finally, Zhang and Lee (2006) constructed a suffix tree representing all possible character n-grams of variable-length and then extracted groups of character n-grams as features.…”
Section: Character Featuresmentioning
confidence: 99%
“…A quite particular case of using character information is the compression-based approaches (Benedetto, Caglioti, & Loreto, 2002;Khmelev & Teahan, 2003a;Marton, et al, 2005). The main idea is to use the compression model acquired from one text to compress another text, usually based on off-the-shelf compression programs.…”
Section: Character Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…This may be accounted for by using a probabilistic distance measure such as K-L divergence between Markov model probability distributions of the texts (Juola 1998;Khmelev 2001;Khmelev and Tweedie 2002;Juola & Baayen 2003;Sanderson and Guenter 2006), possibly implicitly in the context of compression methods (Kukushkina et al 2001;Benedetto et al 2002;Khmelev and Teahan 2003;Marton et al 2005).…”
Section: Multivariate Analysis Approachmentioning
confidence: 99%
“…it offers a large pool of texts of unquestioned authorship that cover a variety of topics and it has already been used by previous studies [24,18,31]. We selected the top 10 authors with respect to the amount of texts belonging to the topic class CCAT (about corporate and industrial news) to minimize the topic factor for distinguishing between texts.…”
Section: Corpus and Settingsmentioning
confidence: 99%