2013
DOI: 10.18637/jss.v052.i06
|View full text |Cite
|
Sign up to set email alerts
|

ThetextcatPackage forn-Gram Based Text Categorization inR

Abstract: Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
28
0
1

Year Published

2014
2014
2021
2021

Publication Types

Select...
7
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 50 publications
(29 citation statements)
references
References 7 publications
0
28
0
1
Order By: Relevance
“…The information theory and statistics literature contain numerous measures to compare frequency distributions (see Jurafsky & Martin, 2009). In previous unpublished work, we found that using the Kullback–Leibler divergence (Kullback & Leibler, 1951) as the comparison metric outperforms the “out-of-place” distance metric of Cavnar and Trenkle and the Kullback-Leibler metric is applied using textcat (Hornik, Rauch, Buchta, & Feinerer, 2012). Although this metric is not strictly speaking a “distance” (for instance Kullback–Leibler is not symmetric), in this context the metrics behave intuitively like distances, so we refer to the output of the metric as a distance.…”
Section: Methodsmentioning
confidence: 99%
“…The information theory and statistics literature contain numerous measures to compare frequency distributions (see Jurafsky & Martin, 2009). In previous unpublished work, we found that using the Kullback–Leibler divergence (Kullback & Leibler, 1951) as the comparison metric outperforms the “out-of-place” distance metric of Cavnar and Trenkle and the Kullback-Leibler metric is applied using textcat (Hornik, Rauch, Buchta, & Feinerer, 2012). Although this metric is not strictly speaking a “distance” (for instance Kullback–Leibler is not symmetric), in this context the metrics behave intuitively like distances, so we refer to the output of the metric as a distance.…”
Section: Methodsmentioning
confidence: 99%
“…This task demonstrates how our approach performs on real event based sequences (non-time series) rather than artificially generated data. The outcomes are compared with the text mining algorithm -"TextCat" (Hornik et al 2013). …”
Section: Figure 4: the Procedures Of Event Group Based Ts Classificatimentioning
confidence: 99%
“…98% of all treaty texts in the original dataset were found. These texts underwent some cleaning, and then the textcat package in R was used to construct a matrix of Jensen-Shannon divergences between the n-gram frequency distributions of all MFA texts (Hornik et al, 2013). 4 Jensen-Shannon was chosen over competitors for its twin advantages of being symmetric and finite (ranging between 0 and 1).…”
Section: Mfa Content Similarity Network -Bb Networkmentioning
confidence: 99%