Augmenting Naive Bayes Classifiers with Statistical Language Models

Peng, Fuchun; Schuurmans, Dale; Wang, Shaojun

doi:10.1023/b:inrt.0000011209.19643.e2

Cited by 211 publications

(156 citation statements)

References 43 publications

Supporting

Mentioning

153

Contrasting

Unclassified

Order By: Relevance

“…More recently, Graham et al (2005) and Zheng et al (2006) used neural networks on a wide variety of features. Other studies used k-nearest neighbor (Kjell et al 1995;Hoorn et al 1999;Zhao & Zobel 2005), Naive Bayes (Kjell 1994a;Hoorn et al 1999;Peng et al 2004), rule learners (Holmes & Forsyth 1995;Holmes 1998;Argamon et al 1998;Koppel & Schler 2003;Abbasi & Chen 2005;Zheng et al 2006), support vector machines (De Vel et al 2001;Diederich et al 2003;Koppel & Schler 2003, Abbasi & Chen 2005Koppel et al 2005;Zheng et al 2006), Winnow (Koppel et al 2002;Argamon et al 2003;Koppel et al 2006a), and Bayesian regression Madigan et al 2006;Argamon et al 2008). Further details regarding these studies can be found in the Appendix.…”

Section: Machine Learning Approachmentioning

confidence: 99%

Computational methods in authorship attribution

Koppel

Schler

Argamon

2008

J. Am. Soc. Inf. Sci.

461

278

View full text Add to dashboard Cite

Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typically fall short of this ideal. Thus, following detailed discussion of previous work, three scenarios are considered here for which solutions to the basic attribution problem are inadequate. In the first variant, the profiling problem, there is no candidate set at all; in this case, the challenge is to provide as much demographic or psychological information as possible about the author. In the second variant, the needle-in-a-haystack problem, there are many thousands of candidates for each of whom we might have a very limited writing sample.In the third variant, the verification problem, there is no closed candidate set but there is one suspect; in this case, the challenge is to determine if the suspect is or is not the author. For each variant, it is shown how machine learning methods can be adapted to handle the special challenges of that variant.

show abstract

Section: Machine Learning Approachmentioning

confidence: 99%

Computational methods in authorship attribution

Koppel

Schler

Argamon

2008

J. Am. Soc. Inf. Sci.

461

278

View full text Add to dashboard Cite

show abstract

“…An alternative way to automatically define the function word set is to extract the most frequent words in a corpus [24,29]. There are also attempts to use word n-grams to exploit contextual information [27,7]. However, this process considerably increases the dimensionality of the problem and has not produced encouraging results so far.…”

Section: Previous Workmentioning

confidence: 99%

“…Such powerful machine learning algorithms can effectively cope with high dimensional and sparse data. Another approach is to apply a generative model, like a naïve Bayes model [27]. Yet another approach is to estimate the similarity between two texts [4,17].…”

Section: Previous Workmentioning

confidence: 99%

“…The other main issue is the development of attribution methodologies to assign texts to one candidate author. So far, the proposed attribution models comprise standard discriminative algorithms (e.g., support vector machines) [9] and generative models (e.g., Bayesian methods) [27] as well as models specifically designed for authorship identification tasks [21,29].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Tensor Space Models for Authorship Identification

Plakias

Stamatatos

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Authorship identification can be viewed as a text categorization task. However, in this task the most frequent features appear to be the most important discriminators, there is usually a shortage of training texts, and the training texts are rarely evenly distributed over the authors. To cope with these problems, we propose tensors of second order for representing the stylistic properties of texts. Our approach requires the calculation of much fewer parameters in comparison to the traditional vector space representation. We examine various methods for building appropriate tensors taking into account that similar features should be placed in the same neighborhood. Based on an existing generalization of SVM able to handle tensors we perform experiments on corpora controlled for genre and topic and show that the proposed approach can effectively handle cases where only limited training texts are available.

show abstract

“…Researchers [2,14,15] have previously used language models for document classification and such an approach was essentially Bayesian. We too adopt a Bayesian approach but, in common with most IR applications, apply models that are unigram in that they consider each term independently and do not take account of the preceding tokens.…”

Section: Document Generationmentioning

confidence: 99%

A Language Modelling Approach to Linking Criminal Styles with Offender Characteristics

Bache

Crestani

Canter

et al.

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The ability to infer the characteristics of offenders from their criminal behaviour ('offender profiling') has only been partially successful since it has relied on subjective judgments based on limited data. Words and structured data used in crime descriptions recorded by the police relate to behavioural features. Thus Language Modelling was applied to an existing police archive to link behavioural features with significant characteristics of offenders. Both multinomial and multiple Bernoulli models were used. Although categories selected are gender, age group, ethnic appearance and broad occupation (employed or not), in principle this can be applied to any characteristic recorded. Results indicate that statistically significant relationships exist between all characteristics for many types of crime. Bernoulli models tend to perform better than multinomial ones. It is also possible to identify automatically specific terms which when taken together give insight into the style of offending related to a particular group.

show abstract

Augmenting Naive Bayes Classifiers with Statistical Language Models

Cited by 211 publications

References 43 publications

Computational methods in authorship attribution

Computational methods in authorship attribution

Tensor Space Models for Authorship Identification

A Language Modelling Approach to Linking Criminal Styles with Offender Characteristics

Contact Info

Product

Resources

About