The Nuts and Bolts of Automated Text Analysis. Comparing Different Document Pre-Processing Techniques in Four Countries

Greene, Zac; Cerón, Andrea; Schumacher, Gijs; Fazekas, Zoltán

doi:10.31219/osf.io/ghxj8

Cited by 12 publications

(9 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…stemming) has been performed given that in the Italian case preprocessing tends to produce estimates that are highly correlated (Greene et al, 2015). Indeed, in the present article, the correlation between our estimates and the estimates obtained after stemming words is above 0.9.…”

Section: Estimating Policy Positions From Social Mediamentioning

confidence: 48%

Intra-party politics in 140 characters

Cerón

2016

Party Politics

Self Cite

View full text Add to dashboard Cite

Scholars have emphasized the need to deepen investigation of intraparty politics. Recent studies look at social media as a source of information on the ideological preferences of politicians and political actors. In this regard, the present article tests whether social media messages published by politicians are a suitable source of data. It applies quantitative text analysis to the public statements released by politicians on social media in order to measure intraparty heterogeneity and assess its effects. Three different applications to the Italian case are discussed. Indeed, the content of messages posted online is informative on the ideological preferences of politicians and proved to be useful to understand intraparty dynamics. Intraparty divergences measured through social media analysis explain: (a) a politician's choice to endorse one or another party leader, (b) a politician's likelihood to switch off from his or her parliamentary party group; and (c) a politician's probability to be appointed as a minister.

show abstract

Section: Estimating Policy Positions From Social Mediamentioning

confidence: 48%

Intra-party politics in 140 characters

Cerón

2016

Party Politics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Following Grimmer and Stewart (2013, p. 272-273), we pre-processed the documents to make them suitable for computational text analysis by removing numbers, symbols, and words drawn from language-specific lists of stopwords. In our analyses, pre-processing by removing the 20 most frequent words instead of the stopwords (Ruedin, 2013a) produced near identical results, but we acknowledge that different pre-processing choices are likely to affect the substantive conclusions in multivariate models (Denny & Spirling, 2018;Greene, Ceron, Schumacher, & Fazekas, 2016).We do not use stemming as this decreases the effectiveness of the method (Ruedin, 2013b) and because it is not beneficial for all languages. This is especially the case for languages in which compound words are common, such as in German or Finnish, where stemming may lead to a reduction of information.…”

Section: Pre-processing and Estimationmentioning

confidence: 94%

Validating Wordscores: The Promises and Pitfalls of Computational Text Scaling

Bruinsma

Gemenis

2019

Communication Methods and Measures

View full text Add to dashboard Cite

Wordscores is a popular computational text analysis method with numerous applications in communication research. Wordscores claims to scale documents on specified dimensions without requiring researchers to read or even understand the language of the input text. We investigate whether Wordscores delivers this claim by scaling the Euromanifestos of 117 political parties across 23 countries on 4 salient dimensions of political conflict. We assess validity by comparing the Wordscores estimates to expert surveys and other judgmental measures, and by examining the Wordscores's estimates ability to predict party membership in the European Parliament groups. We find that the Wordscores estimates correlate poorly with expert and judgmental measures of party positions, while the latter outperform Wordscores in the predictive validity test. We conclude that Wordscores does not live up to its original claim of a "quick and easy" language blind method, and urge researchers to demonstrate the validity of the method in their domain of interest before any empirical analysis. Computational text analysis is a rapidly growing research field with many applications in political communication research. From using Twitter data to identify the political preferences of citizens (

show abstract

“…Preprocessing has tremendous consequences for the quality of automated text analysis. Recent studies demonstrate how preprocessing decisions impact on sentiment analysis or dimensional scaling (Greene, Ceron, Schumacher, & Fazekas, 2016) results. Yet, the amount of necessary preprocessing also depends on the quality of the raw data.…”

Section: Preprocessingmentioning

confidence: 99%

More than Bags of Words: Sentiment Analysis with Word Embeddings

Rudkowsky

Haselmayer

Wastian

et al. 2018

Communication Methods and Measures

173

102

View full text Add to dashboard Cite

Moving beyond the dominant bag-of-words approach to sentiment analysis we introduce an alternative procedure based on distributed word embeddings. The strength of word embeddings is the ability to capture similarities in word meaning. We use word embeddings as part of a supervised machine learning procedure which estimates levels of negativity in parliamentary speeches. The procedure's accuracy is evaluated with crowdcoded training sentences; its external validity through a study of patterns of negativity in Austrian parliamentary speeches. The results show the potential of the word embeddings approach for sentiment analysis in the social sciences.

show abstract

The Nuts and Bolts of Automated Text Analysis. Comparing Different Document Pre-Processing Techniques in Four Countries

Cited by 12 publications

References 22 publications

Intra-party politics in 140 characters

Intra-party politics in 140 characters

Validating Wordscores: The Promises and Pitfalls of Computational Text Scaling

More than Bags of Words: Sentiment Analysis with Word Embeddings

Contact Info

Product

Resources

About