Latent Dirichlet allocation (LDA) topic models are increasingly being used in communication research. Yet, questions regarding reliability and validity of the approach have received little attention thus far. In applying LDA to textual data, researchers need to tackle at least four major challenges that affect these criteria: (a) appropriate pre-processing of the text collection; (b) adequate selection of model parameters, including the number of topics to be generated; (c) evaluation of the model's reliability; and (d) the process of validly interpreting the resulting topics. We review the research literature dealing with these questions and propose a methodology that approaches these challenges. Our overall goal is to make LDA topic modeling more accessible to communication researchers and to ensure compliance with disciplinary standards. Consequently, we develop a brief hands-on user guide for applying LDA topic modeling. We demonstrate the value of our approach with empirical data from an ongoing research project.
We propose a methodological approach to analyze the content of hyperlink networks which represent networked public spheres on the Internet. Using the case of the food safety movement in the United States, we demonstrate how to generate a hyperlink network with the web crawling tool Issue Crawler and merge it with the results of a probabilistic topic model of the network’s content. Combining hyperlink networks and content analysis allows us to interpret such a network in its entirety and with regard to the mobilizing potentials of specific sub-issues of the movement. We focus on two specific sub-issues in the food safety network, genetically modified food and food control, in order to trace the involved websites and their interlinking structures, respectively.
Previous work has shown that hyperlinks reflect actors' strategic choices; these dyadic relationships depend on the actors' exogenous attributes (e.g., homophily) and the network's endogenous features (e.g., prestige distribution among actors). We combine these factors as explanatory variables in different exponential random graph models (ERGMs) to assess the relative strength of prestige and homophily for the actors' link formation. We analyze the climate change discourse in a hyperlink network formed by US civil society actors from November 2014 and test how relevant the different factors are, including variables such as actor type, country, position, and topic. We find that both prestige and various aspects of homophily influence link formation online. With regard to the importance of the different factors, positional homophily stands out, followed by prestige and other homophily effects.
In this article, we focus on noise in the sense of irrelevant information in a data set as a specific methodological challenge of web research in the era of big data. We empirically evaluate several methods for filtering hyperlink networks in order to reconstruct networks that contain only webpages that deal with a particular issue. The test corpus of webpages was collected from hyperlink networks on the issue of food safety in the United States and Germany. We applied three filtering strategies and evaluated their performance to exclude irrelevant content from the networks: keyword filtering, automated document classification with a machine-learning algorithm, and extraction of core networks with network-analytical measures. Keyword filtering and automated classification of webpages were the most effective methods for reducing noise, whereas extracting a core network did not yield satisfying results for this case.
In this paper, we focus on noise in the sense of irrelevant information in a data set as a specific methodological challenge of web research in the era of big data. We empirically evaluate several methods for filtering hyperlink networks in order to reconstruct networks that contain only web pages that deal with a particular issue. The test corpus of web pages was collected from hyperlink networks on the issue of food safety in the United States and Germany. We applied three filtering strategies and evaluated their performance to exclude irrelevant content from the networks: keyword filtering, automated document classification with a machine-learning algorithm, and extraction of core networks with network-analytical measures. Keyword filtering and automated classification of web pages were the most effective methods for reducing noise whereas extracting a core network did not yield satisfying results for this case.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.