2003
DOI: 10.1093/llc/18.4.423
|View full text |Cite
|
Sign up to set email alerts
|

Ngram and Bayesian Classification of Documents for Topic and Authorship

Abstract: Large, real world, data sets have been investigated in the context of Authorship Attribution of real world documents. Ngram measures can be used to accurately assign authorship for long documents such as novels. A number of 5 (authors) ϫ 5 (movies) arrays of movie reviews were acquired from the Internet Movie Database. Both ngram and naive Bayes classifiers were used to classify along both the authorship and topic (movie) axes. Both approaches yielded similar results, and authorship was as accurately detected,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
1
1

Year Published

2006
2006
2018
2018

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 52 publications
(26 citation statements)
references
References 28 publications
0
24
1
1
Order By: Relevance
“…Chaski (2001) described a writing sample database comprising texts of 92 people on 10 common subjects (e.g., a letter of apology to your best friend, a letter to your insurance company, etc.). Clement and Sharp (2003) reported a corpus of movie reviews comprising 5 authors who review the same 5 movies. Another corpus comprising various genres was described by Baayen, van Halteren, Neijt, and Tweedie (2002) and was also used by Baayen (2005) andvan Halteren (2007).…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…Chaski (2001) described a writing sample database comprising texts of 92 people on 10 common subjects (e.g., a letter of apology to your best friend, a letter to your insurance company, etc.). Clement and Sharp (2003) reported a corpus of movie reviews comprising 5 authors who review the same 5 movies. Another corpus comprising various genres was described by Baayen, van Halteren, Neijt, and Tweedie (2002) and was also used by Baayen (2005) andvan Halteren (2007).…”
Section: Discussionmentioning
confidence: 99%
“…Several features described in Section 2 are claimed to capture only stylistic information (e.g., function words). However, the application of stylometric features to topic-identification tasks has revealed the potential of these features to indicate content information as well (Clement & Sharp, 2003;Mikros & Argiri, 2007). It seems that low-level features like character n-grams are very successful for representing texts for stylistic purposes Keselj, et al, 2003;Stamatatos, 2006b;Grieve, 2007).…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Comprehensiveness is measured as the length of post (Lu et al 2010). Sentiments expressed in the posts are classified as positive or negative using a Bayesian classifier (Clement and Sharp 2003) that is trained over word n-grams of order 5. The order was selected using cross-validation over a sentiment classification dataset (Pang and Lee 2004).…”
Section: Reading Setmentioning
confidence: 99%
“…What is the relative degree to which these linguistic and non-linguistic characteristics represent topicality and predict the utility of a document to a system user? Term roots may carry one or more meanings or topics, and the addition of contextual or supporting information, such as suffixes, part-of-speech tags, and larger contexts can contribute to the topicality, therefore improving document ordering (Bossong, 1989;Clement & Sharp, 2003;Losee, 2001) How does one measure how many of one feature or type of feature is equivalent in ordering power to another chosen feature or type of feature? The Relative Feature Utility may be used to empirically analyze the ordering effects of term stemming, the length of natural language phrases, the effect of using different part-of-speech labels, and various information retrieval or filtering assumptions.…”
Section: Introductionmentioning
confidence: 99%