2022
DOI: 10.31235/osf.io/3fkzc
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Augmented Social Scientist. Using Sequential Transfer Learning to Annotate Millions of Texts with Human-Level Accuracy

Abstract: The last decade witnessed a spectacular rise in the volume of available textual data. With this new abundance came the question of how to analyze it. In the social sciences, scholars mostly resorted to two well-established approaches, human annotation on sampled data on the one hand (either performed by the researcher, or outsourced to microworkers), and quantitative methods on the other. Each approach has its own merits - a potentially very fine-grained analysis for the former, a very scalable one for the lat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(13 citation statements)
references
References 16 publications
2
11
0
Order By: Relevance
“…We find that advanced supervised machine learning classification methods using transformer language models can approach the performance of human analysis when it comes to inference on various internal states from short texts. Our results, thus, echo recent suggestions about the potential of deep learning methods in social science applications (van Atteveldt et al 2021;Bonikowski, Luo, and Stuhler 2022;Do, Ollion, and Shen 2022;Widmann and Wich 2022). Yet, we also suggest that increased method complexity does not always warrant a large improvement in performancesimple supervised machine learning methods such as logistic regression can sometimes perform almost as well as more complex algorithms.…”
Section: Introductionsupporting
confidence: 89%
See 2 more Smart Citations
“…We find that advanced supervised machine learning classification methods using transformer language models can approach the performance of human analysis when it comes to inference on various internal states from short texts. Our results, thus, echo recent suggestions about the potential of deep learning methods in social science applications (van Atteveldt et al 2021;Bonikowski, Luo, and Stuhler 2022;Do, Ollion, and Shen 2022;Widmann and Wich 2022). Yet, we also suggest that increased method complexity does not always warrant a large improvement in performancesimple supervised machine learning methods such as logistic regression can sometimes perform almost as well as more complex algorithms.…”
Section: Introductionsupporting
confidence: 89%
“…Existing evidence suggests that crowdsourcing works rather well for simpler coding tasks (Benoit et al 2016), while more complex 4 Corpus construction also calls for important decisions which are beyond our scope here (yet, see Bonikowski and Nelson 2022). Depending on the application, researchers might choose to work with whole texts, sentences, or text segments as the unit of analysis (for latter, see approaches in Barberá et al 2021 andDo, Ollion, andShen 2022).…”
Section: Step 1: Manual Data Codingmentioning
confidence: 99%
See 1 more Smart Citation
“…For many in the social sciences, computational text analysis comes in two variants, as either supervised and unsupervised methods. Supervised methods rest on the researcher's access to labels for meaning structures in text data, such as categories and a coding scheme, and extrapolate these labels on unseen text (Nelson et al 2021, Chen et al 2018, Lichtenstein & Rucks-Ahidiana 2021, Do et al 2022. Unsupervised methods, by contrast, infer information about language patterns, such as co-occurrences of words in documents, without drawing on predefined categories or coding schemes.…”
Section: Methodsmentioning
confidence: 99%
“…They offer a way to draw on large numbers of people to gather information or complete a large task, such as developing a large dataset to train an ML model. A growing number of studies within the social sciences have used crowdsourcing to develop training data for ML processes, but they have primarily been used to analyze large-scale text corpuses (e.g., Benoit et al 2016; Benoit, Munger, and Spirling 2019; Budak, Goel, and Rao 2016; Do, Ollion, and Shen 2022; Nelson et al 2021; Wilkerson and Casas 2017; Ying, Montgomery, and Stewart 2022). Scholars in computer science have taken this approach with visual data to examine perceptions of streetscapes (Naik et al 2014) and the make and model of cars in images (Gebru et al 2017).…”
Section: Introductionmentioning
confidence: 99%