Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation (LDA). However, examples of qualitative studies that employ topic modelling as a tool are currently few and far between. In this work, we identify two important problems along the way to using topic models in qualitative studies: lack of a good quality metric that closely matches human judgement in understanding topics and the need to indicate specific subtopics that a specific qualitative study may be most interested in mining. For the first problem, we propose a new quality metric, tf-idf coherence, that reflects human judgement more accurately than regular coherence, and conduct an experiment to verify this claim. For the second problem, we propose an interval semi-supervised approach (ISLDA) where certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments. Our experiments show that ISLDA is better for topic extraction than LDA in terms of tf-idf coherence, number of topics identified to predefined keywords and topic stability. We also present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.
This article describes agendas as “packages” of topics of varying salience, set by the Russian Internet users on Russia's leading blog platform LiveJournal. The research involved modeling LiveJournal's topic structure, viewed as an important component of what is termed here self‐generated public opinion. Topic modeling was performed automatically with the LDA algorithm, and complemented with hand labeling of topics. Data were collected by software created by the authors to generate a relational database storing all posts by the top 2,000 LiveJournal users from three one‐month periods: two during the Russian parliamentary and presidential elections 2011–2012, and one control period. We find that LiveJournal top users share their attention evenly between “social/political” and “private/recreational” issues, the proportion being very stable. However, the substitution of diverse public affairs issues by the topics related to national street protests in the politicized periods compared to the control period was found both automatically and manually. The group of topics centered around social issues demonstrates the biggest volatility in terms of its composition and may serve as the foundation for monitoring self‐generated public opinion by further application of sentiment/opinion mining methods to these topics.
Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. Our method is inspired by the concepts from statistical physics and is based on Sharma–Mittal entropy. We test our approach on two models: probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) with Gibbs sampling, and on two datasets in different languages. We compare our approach against a number of standard metrics, each of which is able to account for just one of the parameters of our interest. We demonstrate that Sharma–Mittal entropy is a convenient tool for selecting both the number of topics and the values of hyper-parameters, simultaneously controlling for semantic stability, which none of the existing metrics can do. Furthermore, we show that concepts from statistical physics can be used to contribute to theory construction for machine learning, a rapidly-developing sphere that currently lacks a consistent theoretical ground.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.