Text Analysis in R

Welbers, Kasper; Atteveldt, Wouter van; Benoit, Kenneth

doi:10.1080/19312458.2017.1387238

Cited by 237 publications

(148 citation statements)

References 33 publications

Supporting

Mentioning

140

Contrasting

Unclassified

Order By: Relevance

“…To address the first aim of this study, we created an e-cigarette topics dictionary using the software package Quanteda (Welbers, Van Atteveldt, & Benoit, 2017). This dictionary categorized Reddit submissions based on seven different e-cigarette-related topics: (1) "advice," which refers to submissions about seeking information; (2) "build your own," which refers to submissions about e-cigarette parts or kits that can be used to build vaping devices; (3) "buying/selling," which refers to submissions about e-cigarettes as merchandise; (4) "drugs," which refers to submissions about the use of vaping devices for illicit purposes, such as vaping marijuana; (5) "e-juice," which refers to submissions about e-liquid or e-liquid flavors; (6) "health/safety," which refers to submissions about the various health effects associated with e-cigarettes; and (g) "tobacco," which refers to submissions containing tobacco-related content, including mentions of combustible cigarettes or nicotine.…”

Section: Identifying Submission Topicsmentioning

confidence: 99%

Topic Clustering of E-Cigarette Submissions Among Reddit Communities: A Network Perspective

Barker

Rohde

2019

Health Educ Behav

View full text Add to dashboard Cite

E-cigarette use in the United States has significantly grown in recent years. Widespread diffusion of e-cigarette content across social media communities may be contributing to this growth. In this study, we (1) explored topics related to e-cigarettes and vaping on Reddit and (2) examined the extent to which these topics clustered across distinct communities. We analyzed a total of N = 79,783 Reddit submissions posted between March 2017 and February 2018 that mentioned at least one e-cigarette or vaping keyword. We created a dictionary to classify submissions into seven different topics related to e-cigarettes and vaping. Submissions were also categorized into one of six mutually exclusive communities identified using subreddit meta-data. Our results indicate that e-cigarette and vaping content on Reddit is primarily about the buying and selling of e-cigarette products. Other common topics included how to build vaping devices, e-juice, and e-cigarette advice. Network correlation analyses found that the distribution of our seven identified topics varied significantly among general e-cigarette, drugs, and research/news subreddit communities. Findings from this study add to a growing literature investigating e-cigarettes and vaping on social media and also contribute to network-level theories by linking communities on Reddit to the diffusion of various depictions of e-cigarettes and vaping.

show abstract

Section: Identifying Submission Topicsmentioning

confidence: 99%

Topic Clustering of E-Cigarette Submissions Among Reddit Communities: A Network Perspective

Barker

Rohde

2019

Health Educ Behav

View full text Add to dashboard Cite

show abstract

“…Schubert et al [31] presented a novel methodology to model word significance and word affinity in a text and build the word cloud based on the derived dependency. Welbers et al [32] provided a summary of common steps and actions in a computational text analysis project and demonstrated how every step can be completed using the R statistical software.…”

Section: F Text Miningmentioning

confidence: 99%

“…This evaluation generates the weight of a term directly symmetric to its frequency in each document and inversely symmetric to its frequency to the set of documents. In a term-document matrix rows correspond to terms and columns correspond to documents in corpus [32]. Weighting of TDM is term frequency-inverse document frequency (TF-IDF) [14].…”

Section: ) Constructing Tdmmentioning

confidence: 99%

Actionable Analytics on Software Requirement Specifications

Bamizadeh*,

Kumar,

Kumar

et al. 2020

IJRTE

View full text Add to dashboard Cite

The volume of data and need for churning this data to provide useful information has increased the scope of data mining and made it promising in recent years. Software intelligence (SI) (as the future of the mining software engineering data) presents theories and techniques to augment software decision making by using fact-based support systems. SI exposes software practitioners to up-to-date and relevant information to support their daily decision activities over the complete software development life cycle. Software documents contain important information for a plenty of software engineering tasks and one such important document is Software requirement specification (SRS) which details the system and user requirements. Inexplicit, ambiguous or imperfect requirements guide leads to a non-acceptable product by users. Constructing of a strong software specification can be supported by building a semantic space, validating new specification for completeness, categorization of software requirement specification and identification of significant concepts and related keywords. This paper proposes a knowledge management system for software document repositories using data analytics and demonstrates its creation and usage for a document set of software requirement specifications

show abstract

“…While data mining (DM) assumes that data is stored in a structured format, TM data needs no structured format. Thus, TM data requires the application of preprocessing operations to identify and extract features representative of natural language documents (Welbers, Van Atteveldt, & Benoit, 2017). Due to the importance of natural language processing in TM, the latter draws on the advances of other computer science disciplines, like data science, to achieve its objectives.…”

Section: Automatic Literature Reviewmentioning

confidence: 99%

“…For this reason, and according to the scope of this work, it was decided to create a single dataset based on the fields described in Figure 4 by fusion of the two results. This involved a normalisation process: the conversion of all text into lowercase, thus transforming all words into a uniform form (Welbers et al, 2017). All text preprocessing was performed using the "NLP" (Hornik, 2017) and "tm" (Feinerer & Hornik, 2017) R packages.…”

Section: Data Extraction and Pre-processingmentioning

confidence: 99%

Predictive models for hotel booking cancellation: a semi-automated analysis of the literature

António

Almeida²,

Nunes³

2019

TMS

View full text Add to dashboard Cite

This study sought to combine data science tools and capabilities with human judgement and interpretation in order to demonstrate how semiautomatic analysis of the literature can contribute to identifying and synthesising research findings and topics about booking cancellation forecasting. The study also focused on recording in detail the analysis's full experimental procedure to encourage other researchers to conduct automated literature reviews in order to understand more fully the current tendencies in their field of study. The data were obtained through a keyword search in Scopus and Web of Science databases. The methodology presented not only diminishes human bias but also enhances data visualisation and text mining techniques' ability to facilitate abstraction, expedite analysis and improve literature reviews. The results show that, despite the importance of forecasting booking cancellations to understanding net demand and improving cancellation and overbooking policies, further research on this subject is needed.

show abstract

Text Analysis in R

Cited by 237 publications

References 33 publications

Topic Clustering of E-Cigarette Submissions Among Reddit Communities: A Network Perspective

Topic Clustering of E-Cigarette Submissions Among Reddit Communities: A Network Perspective

Actionable Analytics on Software Requirement Specifications

Predictive models for hotel booking cancellation: a semi-automated analysis of the literature

Contact Info

Product

Resources

About