The<b>textcat</b>Package for<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:math>-Gram Based Text Categorization in<i>R</i>

Feinerer, Ingo; Buchta, Christian; Geiger, Wilhelm; Rauch, J.; Mair, Patrick; Hornik, Kurt

doi:10.18637/jss.v052.i06

Cited by 50 publications

(29 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The information theory and statistics literature contain numerous measures to compare frequency distributions (see Jurafsky & Martin, 2009). In previous unpublished work, we found that using the Kullback–Leibler divergence (Kullback & Leibler, 1951) as the comparison metric outperforms the “out-of-place” distance metric of Cavnar and Trenkle and the Kullback-Leibler metric is applied using textcat (Hornik, Rauch, Buchta, & Feinerer, 2012). Although this metric is not strictly speaking a “distance” (for instance Kullback–Leibler is not symmetric), in this context the metrics behave intuitively like distances, so we refer to the output of the metric as a distance.…”

Section: Methodsmentioning

confidence: 99%

A computational language approach to modeling prose recall in schizophrenia

Rosenstein

Diaz-Asper

Foltz

et al. 2014

Cortex

View full text Add to dashboard Cite

Many cortical disorders are associated with memory problems. In schizophrenia, verbal memory deficits are a hallmark feature. However, the exact nature of this deficit remains elusive. Modeling aspects of language features used in memory recall have the potential to provide means for measuring these verbal processes. We employ computational language approaches to assess time-varying semantic and sequential properties of prose recall at various retrieval intervals (immediate, 30 min and 24 h later) in patients with schizophrenia, unaffected siblings and healthy unrelated control participants. First, we model the recall data to quantify the degradation of performance with increasing retrieval interval and the effect of diagnosis (i.e., group membership) on performance. Next we model the human scoring of recall performance using an n-gram language sequence technique, and then with a semantic feature based on Latent Semantic Analysis. These models show that automated analyses of the recalls can produce scores that accurately mimic human scoring. The final analysis addresses the validity of this approach by ascertaining the ability to predict group membership from models built on the two classes of language features. Taken individually, the semantic feature is most predictive, while a model combining the features improves accuracy of group membership prediction slightly above the semantic feature alone as well as over the human rating approach. We discuss the implications for cognitive neuroscience of such a computational approach in exploring the mechanisms of prose recall.

show abstract

Section: Methodsmentioning

confidence: 99%

A computational language approach to modeling prose recall in schizophrenia

Rosenstein

Diaz-Asper

Foltz

et al. 2014

Cortex

View full text Add to dashboard Cite

show abstract

“…This task demonstrates how our approach performs on real event based sequences (non-time series) rather than artificially generated data. The outcomes are compared with the text mining algorithm -"TextCat" (Hornik et al 2013). …”

Section: Figure 4: the Procedures Of Event Group Based Ts Classificatimentioning

confidence: 99%

An Event Group Based Classification Framework for Multi-variate Sequential Data

Sun

Stirling

2017

AJIS

View full text Add to dashboard Cite

Decision tree algorithms were not traditionally considered for sequential data classification, mostly because feature generation needs to be integrated with the modelling procedure in order to avoid a localisation problem. This paper presents an Event Group Based Classification (EGBC) framework that utilises an X-of-N (XoN) decision tree algorithm to avoid the feature generation issue during the classification on sequential data. In this method, features are generated independently based on the characteristics of the sequential data. Subsequently an XoN decision tree is utilised to select and aggregate useful features from various temporal and other dimensions (as event groups) for optimised classification. This leads the EGBC framework to be adaptive to sequential data of differing dimensions, robust to missing data and accommodating to either numeric or nominal data types. The comparatively improved outcomes from applying this method are demonstrated on two distinct areas -a text based language identification task, as well as a honeybee dance behaviour classification problem. A further motivating industrial problem -hot metal temperature prediction, is further considered with the EGBC framework in order to address significant real-world demands.

show abstract

“…98% of all treaty texts in the original dataset were found. These texts underwent some cleaning, and then the textcat package in R was used to construct a matrix of Jensen-Shannon divergences between the n-gram frequency distributions of all MFA texts (Hornik et al, 2013). 4 Jensen-Shannon was chosen over competitors for its twin advantages of being symmetric and finite (ranging between 0 and 1).…”

Section: Mfa Content Similarity Network -Bb Networkmentioning

confidence: 99%