Word embeddings have demonstrated strong performance on NLP tasks. However, lack of interpretability and the unsupervised nature of word embeddings have limited their use within computational social science and digital humanities. We propose the use of informative priors to create interpretable and domain-informed dimensions for probabilistic word embeddings. Experimental results show that sensible priors can capture latent semantic concepts better than or on-par with the current state of the art, while retaining the simplicity and generalizability of using priors.
As large corpora of digitized text become increasingly available, researchers are rediscovering textual data’s potential fruitfulness for inquiries into social and cultural phenomena. Although textual corpora promise to enrich our knowledge of the social world, avoiding problems related to data quality remains a challenge to related empirical research. Hence, evaluating the quality of a corpus will be pivotal for future social scientific inquiries. The authors propose a conceptual framework for total corpus quality, incorporating three crucial dimensions: total corpus error, corpus comparability, and corpus reproducibility. These dimensions affect the validity and reliability of inferences drawn from textual data. In addition, the authors’ framework provides insights toward evaluating and improving studies on the basis of large-scale textual analyses. After outlining this framework, the authors then illustrate an application of the total corpus quality framework by an example case study using digitized newspaper articles to study topic salience over 75 years.
As large corpora of digitized text and novel methodologies become increasingly available, researchers are rediscovering textual data’s potential fruitfulness for inquiries into social and cultural phenomena. While textual corpora show great promise to enrich our knowledge of the social, avoiding problems related to data quality remains a challenge to related empirical research. Hence, evaluating the quality of a corpus will be pivotal for future social science inquiries. We propose a conceptual framework for total corpus quality incorporating three important dimensions—total corpus error, corpus comparability, and corpus reproducibility—impacting the validity and reliability of inferences drawn from textual data. This framework provides insights toward evaluating and improving studies based on large-scale textual analyses. We employ a case study to exemplify and discuss how researchers can identify and measure the three proposed quality dimensions for any given corpus.
Sociologists increasingly discuss the need for more formal ways of measuring meaning from digital text archives.We bring to attention the seeded topic model, a semi-supervised and scalable extension to the standard topic model, that allows the infusion of social science domain knowledge to the computational learning of meaning structures. Seed words help crystallize topics around known concepts, issues, or ideas, while allowing for topic models' basic functionality of finding associations in text data based on word co-occurrences. The method allows identification of discourses on predefined themes over time and the measuring of a theme's shared interpretation via its associations to other frequently co-occurring topics. Illustrating this theoretically informed method, we extract longitudinal measures of the Swedish understanding of immigration in a vast newspaper corpus containing millions of news articles from 1945 to 2019.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.