Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al. 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models—such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.
Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al., 2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models – such as topic models. We use the europarl dataset and compare term-document matrices as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find term-document matrices for both text corpora to be highly similar, with significant but minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regards to LDA topic models, we find topical prevalence and topical content to be highly similar with only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.
News diversity is an important concern of journalism scholars, as its presence or absence can have a profound effect on democratic debate and the information available to citizens. Many have speculated that news diversity decreases over time, due to changing economic circumstances. This expectation especially applies to newspapers. Using nearly two decades of newspaper data from four European countries (Denmark, The Netherlands, Norway, UK), we do not find this expected decrease in news diversity. When conducting pairwise, automated comparisons between articles published on the same day in the same country, we rather find a modest over time increase in diversity between newspapers. This result suggests that newspapers differentiate rather than converge in the content they offer, shedding a more positive light on the evolution of the press in our current high-choice media environments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.