In this paper, we present an approach to developing resources for a low-resource language, taking advantage of the fact that it is closely related to languages with more resources. In particular, we test our approach on Macedonian, which lacks tools for natural language processing as well as data in order to build such tools. We improve the Macedonian training set for supervised part-ofspeech tagging by transferring available manual annotations from a number of similar languages. Our approach is based on multilingual parallel corpora, automatic word alignment, and a set of rules (majority vote). The performance of a tagger trained on the improved data set of 88% accuracy is significantly better than the baseline of 76%. It can serve as a stepping stone for further improvement of resources for Macedonian. The proposed approach is entirely automatic and it can be easily adapted to other language in similar circumstances.
We analyze the dynamics of dialect loss in a cluster of villages in rural northern Russia based on a corpus of transcribed interviews, the Ustja River Basin Corpus. Eleven phonological and morphological variables are analyzed across 33 speakers born between 1922 and 1996 in a series of logistic regression models. We propose three characteristics for a comparison of the rate of loss of different variables: initial level, steepness, and turning point. We show that the dynamics of loss differs significantly across variables and discuss possible reasons for such differences, including perceptual salience, initial variation in the dialect, and convergence with regionally or socially defined varieties of Russian. In conclusion, we discuss the pros and cons of logistic regression as an approach to quantitative modeling of dialect loss. Our paper contributes to the study and documentation of Russian dialects, most of which are on the verge of extinction.
A Spoken Corpus of Inhabitants of Polish SpiszThe article describes a dialect corpus project that documents the dialect of Polish Spisz. In contrast to the majority of dialectological research in Poland, our corpus also includes the speech of the youngest and middle generations, as its aim is also to document the sociolinguistic situation of the dialect of the region. Recordings have been transcribed into standard Polish orthography, not phonetically, which makes it possible not only to easily search the corpus but also to use existing tools to lemmatize and add morphosyntactic annotation to the texts. Users interested in the phonetic layer can access the recordings on a per-utterance basis. The article describes the stages of compiling the corpus and discusses its potential applications. The authors argue that a large corpus which covers a small, homogeneous area is a more valuable resource for dialectologists than a series of small corpora documenting a larger region.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.