Text Mining with n-gram Variables

Schonlau, Matthias; Guenther, Nick; Sucholutsky, Ilia

doi:10.1177/1536867x1801700406

Cited by 38 publications

(19 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, according to the results of [26], n-gram works better on the shorter texts since the presence of words in shorter texts are more important than longer texts. Namely, the value of a word loses its significance or value in a long text.…”

Section: N-grammentioning

confidence: 99%

The impact of text preprocessing on the prediction of review ratings

Işik¹,

Dağ²

2020

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

With the increase of e-commerce platforms and online applications, businessmen are looking to have a rating and review system through which they can easily reveal the feelings of customers related to their products and services. It is undeniable from the statistics that online ratings and reviews attract new customers as well as increase sales by means of providing confidence, ratification, opinions, comparisons, merchant credibility, etc. Although considerable research has been devoted to the sentiment analysis for review classification, rather less attention has been paid to the text preprocessing which is a crucial step in opinion mining especially if convenient preprocessing strategies are found out to increase the classification accuracy. In this paper, we concentrate on the impact of simple text preprocessing decisions in order to predict fine-grained review rating stars whereas the majority of previous work focused on the binary distinction of positive vs. negative. Therefore, the aim of this research is to analyze preprocessing techniques and their influence, at the same time explain the interesting observations and results on the performance of a five-class-based review rating classifier.

show abstract

Section: N-grammentioning

confidence: 99%

The impact of text preprocessing on the prediction of review ratings

Işik¹,

Dağ²

2020

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

show abstract

“…As a result, we can say that the system is a Markov process of order n, where the previous n messages form a state that influences the next one. Sequences of n consecutive messages are often called "n-grams", and their analysis is common in sequence modelling domains like Natural Language Processing (NLP) [37], [38], [39]. The most straightforward method of using this property is to perform a history search where every time we want to make a prediction, we look at the previous n messages, and then search our entire training dataset to find the most commonly occurring message after this n-gram.…”

Section: E Benchmarkmentioning

confidence: 99%

Deep Learning for System Trace Restoration

Sucholutsky

Narayan

Schonlau

et al. 2019

2019 International Joint Conference on Neural Networks (IJCNN)

Self Cite

View full text Add to dashboard Cite

Most real-world datasets, and particularly those collected from physical systems, are full of noise, packet loss, and other imperfections. However, most specification mining, anomaly detection and other such algorithms assume, or even require, perfect data quality to function properly. Such algorithms may work in lab conditions when given clean, controlled data, but will fail in the field when given imperfect data. We propose a method for accurately reconstructing discrete temporal or sequential system traces affected by data loss, using Long Short-Term Memory Networks (LSTMs). The model works by learning to predict the next event in a sequence of events, and uses its own output as an input to continue predicting future events. As a result, this method can be used for data restoration even with streamed data. Such a method can reconstruct even long sequence of missing events, and can also help validate and improve data quality for noisy data. The output of the model will be a close reconstruction of the true data, and can be fed to algorithms that rely on clean data. We demonstrate our method by reconstructing automotive CAN traces consisting of long sequences of discrete events. We show that given even small parts of a CAN trace, our LSTM model can predict future events with an accuracy of almost 90%, and can successfully reconstruct large portions of the original trace, greatly outperforming a Markov Model benchmark. We separately feed the original, lossy, and reconstructed traces into a specification mining framework to perform downstream analysis of the effect of our method on state-of-the-art models that use these traces for understanding the behavior of complex systems.

show abstract

“…This technique can be easily employed for Western languages. More details on the n -gram approach to text mining can be found in computer science books (Büttcher, Clarke, & Cormack, 2010, chapter 3) and are also described in Schonlau, Guenther, and Sucholutsky (2017).…”

Section: Turning Text Data Into N-gram Variablesmentioning

confidence: 99%

Automatic Classification of Open-Ended Questions: Check-All-That-Apply Questions

Schonlau

Gweon

Wenemark

2019

Social Science Computer Review

Self Cite

View full text Add to dashboard Cite

Text data from open-ended questions in surveys are challenging to analyze and are often ignored. Open-ended questions are important though because they do not constrain respondents’ answers. Where open-ended questions are necessary, often human coders manually code answers. When data sets are large, it is impractical or too costly to manually code all answer texts. Instead, text answers can be converted into numerical variables, and a statistical/machine learning algorithm can be trained on a subset of manually coded data. This statistical model is then used to predict the codes of the remainder. We consider open-ended questions where the answers are coded into multiple labels (all-that-apply questions). For example, in the open-ended question in our Happy example respondents are explicitly told they may list multiple things that make them happy. Algorithms for multilabel data take into account the correlation among the answer codes and may therefore give better prediction results. For example, when giving examples of civil disobedience, respondents talking about “minor nonviolent offenses” were also likely to talk about “crimes.” We compare the performance of two different multilabel algorithms (random k-labelsets [RAKEL], classifier chains [CC]) to the default method of binary relevance (BR) which applies single-label algorithms to each code separately. Performance is evaluated on data from three open-ended questions (Happy, Civil Disobedience, and Immigrant). We found weak bivariate label correlations in the Happy data (90th percentile: 7.6%), and stronger bivariate label correlations in the Civil Disobedience (90th percentile: 17.2%) and Immigrant (90th percentile: 19.2%) data. For the data with stronger correlations, we found both multilabel methods performed substantially better than BR using 0/1 loss (“at least one label is incorrect”) and had little effect when using Hamming loss (average error). For data with weak label correlations, we found no difference in performance between multilabel methods and BR. We conclude that automatic classification of open-ended questions that allow multiple answers may benefit from using multilabel algorithms for 0/1 loss. The degree of correlations among the labels may be a useful prognostic tool.

show abstract

Text Mining with n-gram Variables

Cited by 38 publications

References 17 publications

The impact of text preprocessing on the prediction of review ratings

The impact of text preprocessing on the prediction of review ratings

Deep Learning for System Trace Restoration

Automatic Classification of Open-Ended Questions: Check-All-That-Apply Questions

Contact Info

Product

Resources

About