2017
DOI: 10.1177/1536867x1801700406
|View full text |Cite
|
Sign up to set email alerts
|

Text Mining with n-gram Variables

Abstract: Text mining is the process of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the command ngram, which implements the most common approach to text mining, the “bag of words”. An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of variables, each recording how often the corresponding n-gram occurs in a given text. This is more useful than it sounds. We illustrate ngram with the categorization of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
7
2
1

Relationship

3
7

Authors

Journals

citations
Cited by 38 publications
(19 citation statements)
references
References 17 publications
0
19
0
Order By: Relevance
“…For instance, according to the results of [26], n-gram works better on the shorter texts since the presence of words in shorter texts are more important than longer texts. Namely, the value of a word loses its significance or value in a long text.…”
Section: N-grammentioning
confidence: 99%
“…For instance, according to the results of [26], n-gram works better on the shorter texts since the presence of words in shorter texts are more important than longer texts. Namely, the value of a word loses its significance or value in a long text.…”
Section: N-grammentioning
confidence: 99%
“…As a result, we can say that the system is a Markov process of order n, where the previous n messages form a state that influences the next one. Sequences of n consecutive messages are often called "n-grams", and their analysis is common in sequence modelling domains like Natural Language Processing (NLP) [37], [38], [39]. The most straightforward method of using this property is to perform a history search where every time we want to make a prediction, we look at the previous n messages, and then search our entire training dataset to find the most commonly occurring message after this n-gram.…”
Section: E Benchmarkmentioning
confidence: 99%
“…This technique can be easily employed for Western languages. More details on the n -gram approach to text mining can be found in computer science books (Büttcher, Clarke, & Cormack, 2010, chapter 3) and are also described in Schonlau, Guenther, and Sucholutsky (2017).…”
Section: Turning Text Data Into N-gram Variablesmentioning
confidence: 99%