2016
DOI: 10.31219/osf.io/ghxj8
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Nuts and Bolts of Automated Text Analysis. Comparing Different Document Pre-Processing Techniques in Four Countries

Abstract: Automated text analytic techniques have taken on an increasingly important role in the study of parties and political speech. Researchers have studied manifestos, speeches in parliament, and debates at party national meetings. These methods have demonstrated substantial promise for measuring latent characteristics of texts. In application, however, scaling models require a large number of decisions on the part of the researcher that likely hold substantive implications for the analysis. Past researchers propos… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
9
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
2
1

Relationship

3
5

Authors

Journals

citations
Cited by 12 publications
(9 citation statements)
references
References 22 publications
0
9
0
Order By: Relevance
“…stemming) has been performed given that in the Italian case preprocessing tends to produce estimates that are highly correlated (Greene et al, 2015). Indeed, in the present article, the correlation between our estimates and the estimates obtained after stemming words is above 0.9.…”
Section: Estimating Policy Positions From Social Mediamentioning
confidence: 48%
“…stemming) has been performed given that in the Italian case preprocessing tends to produce estimates that are highly correlated (Greene et al, 2015). Indeed, in the present article, the correlation between our estimates and the estimates obtained after stemming words is above 0.9.…”
Section: Estimating Policy Positions From Social Mediamentioning
confidence: 48%
“…Following Grimmer and Stewart (2013, p. 272-273), we pre-processed the documents to make them suitable for computational text analysis by removing numbers, symbols, and words drawn from language-specific lists of stopwords. In our analyses, pre-processing by removing the 20 most frequent words instead of the stopwords (Ruedin, 2013a) produced near identical results, but we acknowledge that different pre-processing choices are likely to affect the substantive conclusions in multivariate models (Denny & Spirling, 2018;Greene, Ceron, Schumacher, & Fazekas, 2016).We do not use stemming as this decreases the effectiveness of the method (Ruedin, 2013b) and because it is not beneficial for all languages. This is especially the case for languages in which compound words are common, such as in German or Finnish, where stemming may lead to a reduction of information.…”
Section: Pre-processing and Estimationmentioning
confidence: 94%
“…Preprocessing has tremendous consequences for the quality of automated text analysis. Recent studies demonstrate how preprocessing decisions impact on sentiment analysis or dimensional scaling (Greene, Ceron, Schumacher, & Fazekas, 2016) results. Yet, the amount of necessary preprocessing also depends on the quality of the raw data.…”
Section: Preprocessingmentioning
confidence: 99%