Getting Started with Topic Modeling and MALLET

Graham, Shawn; Weingart, Scott; Milligan, Ian

doi:10.46430/phen0017

Cited by 77 publications

(37 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…As the words in a tweet are known, topics, which are latent variables, can be estimated through Gibbs sampling 29. We use the Mallet implementation of the LDA algorithm, adjusting one parameter (alpha=5) to favour fewer topics per tweet 30. All other parameters were kept at their default.…”

Section: Methodsmentioning

confidence: 99%

Studying expressions of loneliness in individuals using twitter: an observational study

et al. 2019

View full text Add to dashboard Cite

ObjectivesLoneliness is a major public health problem and an estimated 17% of adults aged 18–70 in the USA reported being lonely. We sought to characterise the (online) lives of people who mention the words ‘lonely’ or ‘alone’ in their Twitter timeline and correlate their posts with predictors of mental health.Setting and designFrom approximately 400 million tweets collected from Twitter in Pennsylvania, USA, between 2012 and 2016, we identified users whose Twitter posts contained the words ‘lonely’ or ‘alone’ and compared them to a control group matched by age, gender and period of posting. Using natural-language processing, we characterised the topics and diurnal patterns of users’ posts, their association with linguistic markers of mental health and if language can predict manifestations of loneliness. The statistical analysis, data synthesis and model creation were conducted in 2018–2019.Primary outcome measuresWe evaluated counts of language features in the users with posts including the words lonely or alone compared with the control group. These language features were measured by (a) open-vocabulary topics, (b) Linguistic Inquiry Word Count (LIWC) lexicon, (c) linguistic markers of anger, depression and anxiety, and (d) temporal patterns and number of drug words. Using machine learning, we also evaluated if expressions of loneliness can be predicted in users’ timelines, measured by area under curve (AUC).ResultsTwitter timelines of users (n=6202) with posts including the words lonely or alone were found to include themes about difficult interpersonal relationships, psychosomatic symptoms, substance use, wanting change, unhealthy eating and having troubles with sleep. Their posts were also associated with linguistic markers of anger, depression and anxiety. A random forest model predicted expressions of loneliness online with an AUC of 0.86.ConclusionsUsers’ Twitter timelines with the words lonely or alone often include psychosocial features and can potentially have associations with how individuals express and experience loneliness. This can inform low-resource online assessment for high-risk individuals experiencing loneliness and interventions focused on addressing morbidities in this condition.

show abstract

Section: Methodsmentioning

confidence: 99%

Studying expressions of loneliness in individuals using twitter: an observational study

et al. 2019

View full text Add to dashboard Cite

show abstract

“…The number of topics retrieved for tweets about each drug was varied using an optimum topic number test as suggested by a previous method [ 59 ]. We applied the LDA topic model to the documents (tweets) with a randomly specified number of topics and observed the per-document topic distributions results.…”

Section: Methodsmentioning

confidence: 99%

Enhancing Seasonal Influenza Surveillance: Topic Analysis of Widely Used Medicinal Drugs Using Twitter Data

Kagashe¹,

Yan²,

Suheryani³

2017

J Med Internet Res

View full text Add to dashboard Cite

BackgroundUptake of medicinal drugs (preventive or treatment) is among the approaches used to control disease outbreaks, and therefore, it is of vital importance to be aware of the counts or frequencies of most commonly used drugs and trending topics about these drugs from consumers for successful implementation of control measures. Traditional survey methods would have accomplished this study, but they are too costly in terms of resources needed, and they are subject to social desirability bias for topics discovery. Hence, there is a need to use alternative efficient means such as Twitter data and machine learning (ML) techniques.ObjectiveUsing Twitter data, the aim of the study was to (1) provide a methodological extension for efficiently extracting widely consumed drugs during seasonal influenza and (2) extract topics from the tweets of these drugs and to infer how the insights provided by these topics can enhance seasonal influenza surveillance.MethodsFrom tweets collected during the 2012-13 flu season, we first identified tweets with mentions of drugs and then constructed an ML classifier using dependency words as features. The classifier was used to extract tweets that evidenced consumption of drugs, out of which we identified the mostly consumed drugs. Finally, we extracted trending topics from each of these widely used drugs’ tweets using latent Dirichlet allocation (LDA).ResultsOur proposed classifier obtained an F1 score of 0.82, which significantly outperformed the two benchmark classifiers (ie, P<.001 with the lexicon-based and P=.048 with the 1-gram term frequency [TF]). The classifier extracted 40,428 tweets that evidenced consumption of drugs out of 50,828 tweets with mentions of drugs. The most widely consumed drugs were influenza virus vaccines that had around 76.95% (31,111/40,428) share of the total; other notable drugs were Theraflu, DayQuil, NyQuil, vitamins, acetaminophen, and oseltamivir. The topics of each of these drugs exhibited common themes or experiences from people who have consumed these drugs. Among these were the enabling and deterrent factors to influenza drugs uptake, which are keys to mitigating the severity of seasonal influenza outbreaks.ConclusionsThe study results showed the feasibility of using tweets of widely consumed drugs to enhance seasonal influenza surveillance in lieu of the traditional or conventional surveillance approaches. Public health officials and other stakeholders can benefit from the findings of this study, especially in enhancing strategies for mitigating the severity of seasonal influenza outbreaks. The proposed methods can be extended to the outbreaks of other diseases.

show abstract

“…The same word may appear in multiple topics, and in some cases the topics may be more about the genre or style of the discourse than actual content‐bearing words that might more usually be viewed as a topic. Underwood () and Murakami, Thompson, Hunston, and Vajn () describe the LDA method and provide example topic lists, and Graham, Weingart, and Milligan () present a tutorial on how to implement LDA in the MALLET software. Linguists are suspicious of LDA for at least three reasons.…”

Section: Assessment Against Core Principles In Computational Linguisticsmentioning

confidence: 99%

In search of meaning: Lessons, resources and next steps for computational analysis of financial discourse

El-Haj

Rayson

Walker

et al. 2019

Business Fin & Account

105

View full text Add to dashboard Cite

We critically assess mainstream accounting and finance research applying methods from computational linguistics (CL) to study financial discourse. We also review common themes and innovations in the literature and assess the incremental contributions of studies applying CL methods over manual content analysis. Key conclusions emerging from our analysis are: (a) accounting and finance research is behind the curve in terms of CL methods generally and word sense disambiguation in particular; (b) implementation issues mean the proposed benefits of CL are often less pronounced than proponents suggest; (c) structural issues limit practical relevance; and (d) CL methods and high quality manual analysis represent complementary approaches to analyzing financial discourse. We describe four CL tools that have yet to gain traction in mainstream AF research but which we believe offer promising ways to enhance the study of meaning in financial discourse. The four tools are named entity recognition (NER), summarization, semantics and corpus linguistics. K E Y W O R D S 10-K, annual reports, computational linguistics, conference calls, corpus linguistics, earnings announcements, machine learning, NLP, semantics 1Information is the lifeblood of financial markets and the amount of data available to decision-makers is increasing exponentially. Bank of England (2015) estimates that 90% of global information has been created during the last decade, (MD&A), whereas practitioners, standard setters and regulators are often interested in more granular issues such as the format and content of specific disclosures, placement of content within the overall reporting package, limits on the use of jargon concerning particular topics, etc. Second, it is not immediately obvious how commonly employed empirical proxies for discourse quality such as readability (Fog index), tone (word-frequency counts) and text re-use (cosine similarity) map into the practical properties of effective communication identified by financial market regulators.With these caveats in mind, we proceed to review common themes and innovations in the literature and assess the incremental contributions of work applying CL methods over manual content analysis. The median AF study examines 10-K filings using basic content analysis methods such as readability algorithms and keyword counts. The degree of clustering is consistent with the initial phase of the research lifecycle, with agendas shaped as much by ease of data access and implementation as by research priorities. Nevertheless, closer inspection reveals how relatively basic word-level methods have been used to provide richer insights into the properties and effects of financial discourse.Refinements to standard readability metrics, development of domain-specific wordlists, and the use of weighting schemes and text filtering to improve word-sense disambiguation represent welcome advances on naïve unigram word counts. We also acknowledge a move towards the use of more NLP technology in the form of machine learning and topic...

show abstract

Getting Started with Topic Modeling and MALLET

Abstract: In this lesson you will first learn what topic modeling is and why you might want to employ it in your research. You will then learn how to install and work with the MALLET natural language processing toolkit to do so.

Cited by 77 publications

References 0 publications

Studying expressions of loneliness in individuals using twitter: an observational study

Studying expressions of loneliness in individuals using twitter: an observational study

Enhancing Seasonal Influenza Surveillance: Topic Analysis of Widely Used Medicinal Drugs Using Twitter Data

In search of meaning: Lessons, resources and next steps for computational analysis of financial discourse

Contact Info

Product

Resources

About