The paper presents the current version of the Slovene corpus of netspeak Janes which contains tweets, forum posts, news comments, blogs and blog comments, and user and talk pages from Wikipedia. First, we describe the harvesting procedure for each data source and provide a quantitative analysis of the corpus. Next, we present automatic and manual procedures for enriching the corpus with metadata, such as user type, gender and region, and text sentiment and standardness level. Finally, we give a detailed account of the linguistic annotation workflow which includes tokenization, sentence segmentation, rediacritisation, normalization, morphosyntactic tagging and lemmatization.
The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empirically–grounded social–scientific analysis (sometimes dubbed ‘corpus–assisted discourse analysis’ or ‘corpus–based critical discourse analysis’, cf. Hardt–Mautner 1995; Baker 2016). In the post–Yugoslav space, recent corpus developments have brought table–turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist – partly due to the fast–changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one step–by–step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of South–Slavic corpus linguistics and corpus linguistic approaches in the humanities more generally
We propose an approach for comparing communities in social networks based on used hashtags, shared news sources, and discussed topics and their sentiment, both derived through transformer-based language modeling. We apply this multifaceted comparison to study the reactions of the Ex-Yugoslavian retweet communities to the Russian invasion of Ukraine. Our results show that even though Ex-Yugoslavian countries share a common macro-language, their retweet communities respond very heterogeneously to the invasion. We demonstrate the limitations of observing communities using topology only and illustrate the benefits and complementarity of including content when analyzing social networks. As an example, we have identified a community linked to the Serbian governing power that deliberately avoids the invasion topic, which would not be possible by analyzing hashtags only. We conclude that a multifaceted approach provides a more nuanced understanding of social network communities that can be applied to gain insights into complex social events beyond the case study presented in this paper.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.