This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolkit, we show that a tagger trained on a balanced set of the four source languages outperforms single language taggers by about 9%, and that additional automatically induced morphosyntactic lexicons lead to further improvements. The best observed accuracies for Rusyn are 82.4% for part-of-speech tagging and 75.5% for full morphological tagging.
The paper presents experiments on part-of-speech and full morphological tagging of the Slavic minority language Rusyn. The proposed approach relies on transfer learning and uses only annotated resources from related Slavic languages, namely Russian, Ukrainian, Slovak, Polish, and Czech. It does not require any annotated Rusyn training data, nor parallel data or bilingual dictionaries involving Rusyn. Compared to earlier work, we improve tagging performance by using a neural network tagger and larger training data from the neighboring Slavic languages. We experiment with various data preprocessing and sampling strategies and evaluate the impact of multitask learning strategies and of pretrained word embeddings. Overall, while genre discrepancies between training and test data have a negative impact, we improve full morphological tagging by 9% absolute micro-averaged F1 as compared to previous research.
This paper reports on challenges and results in developing NLP resources for spoken Rusyn. Being a Slavic minority language, Rusyn does not have any resources to make use of. We propose to build a morphosyntactic dictionary for Rusyn, combining existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish. We adapt these resources to Rusyn by using vowel-sensitive Levenshtein distance, hand-written language-specific transformation rules, and combinations of the two. Compared to an exact match baseline, we increase the coverage of the resulting morphological dictionary by up to 77.4% relative (42.9% absolute), which results in a tagging recall increased by 11.6% relative (9.1% absolute). Our research confirms and expands the results of previous studies showing the efficiency of using NLP resources from neighboring languages for low-resourced languages.
In the current pilot study, we analyse the Rusyn minority language (or po-našomu 'our way of speaking', as its speakers usually call it) from the perspective of non-expert vernacular speakers of the Zakarpattia region in Western Ukraine. As an addition to traditional dialectological studies, the paper aims at investigating attitudes and folk beliefs towards the mother tongue. Using different methods from the area of perceptual dialectology, we compare the individual representations of speech areas in Zakarpattia. To explore what ordinary people believe about the geographical distribution of linguistic varieties, we make use of, among others, draw-a-map tasks. Additionally, we conducted interviews and applied methods such as correct or pleasant ratings in order to measure the speakers' regard on the previously identified regional varieties. The results show a mainly negative concept of language perception and of Rusyn self-identification within it. Gradual language shift towards Standard Ukrainian occurs, since Rusyn speakers are afraid of being stigmatised as uneducated and rurally conservative. Nonetheless, there is no clear division between language use in official and nonofficial domains, i.e., speakers continue to use po-našomu even in semi-official domains. Rusyn can, thus, be classified as still vital in Zakarpattia. One of the crucial variables for perception as well as production of and regard on the native variety is the educational background of the test persons.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.