Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing 2017
DOI: 10.18653/v1/w17-1405
|View full text |Cite
|
Sign up to set email alerts
|

Lexicon Induction for Spoken Rusyn – Challenges and Results

Abstract: This paper reports on challenges and results in developing NLP resources for spoken Rusyn. Being a Slavic minority language, Rusyn does not have any resources to make use of. We propose to build a morphosyntactic dictionary for Rusyn, combining existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish. We adapt these resources to Rusyn by using vowel-sensitive Levenshtein distance, hand-written language-specific transformation rules, and combinations of the two. C… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2017
2017
2019
2019

Publication Types

Select...
1
1

Relationship

2
0

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 8 publications
0
3
0
Order By: Relevance
“…In Rabus and Scherrer (2017), we describe the automatic induction of morphosyntactic lexicons for Rusyn. In a nutshell, we match Rusyn words extracted from RUE1 and RUE2 with source language words extracted from the Polish, Slovak, Ukrainian and Russian MULTEXT-East lexicons as well as the morphological dictionary of UGtag 9 (Kotsyba et al, 2011), using vowel-sensitive Levenshtein distance, hand-written rules, and a combination of both.…”
Section: Adding Automatically Induced Lexiconsmentioning
confidence: 99%
“…In Rabus and Scherrer (2017), we describe the automatic induction of morphosyntactic lexicons for Rusyn. In a nutshell, we match Rusyn words extracted from RUE1 and RUE2 with source language words extracted from the Polish, Slovak, Ukrainian and Russian MULTEXT-East lexicons as well as the morphological dictionary of UGtag 9 (Kotsyba et al, 2011), using vowel-sensitive Levenshtein distance, hand-written rules, and a combination of both.…”
Section: Adding Automatically Induced Lexiconsmentioning
confidence: 99%
“…In contrast, tagging performance on the Rusyn test set decreases substantially despite the lower OOV rate. In Rabus and Scherrer (2017), we have found that a large proportion of Rusyn word forms can be matched with source language word forms using hand-crafted correspondence rules or vowel-sensitive Levenshtein distance. Here, we use the latter approach to artificially create Rusyn word embeddings for the out-of-vocabulary words: each Rusyn word form of the test set that does not occur in the Panslav5 file is associated with the most similar forms in Panslav5, and its embedding vector is computed by averaging the vectors of the found similar forms.…”
Section: Word Embeddingsmentioning
confidence: 99%
“…In our previous work on Rusyn morphosyntactic tagging, we followed a twofold approach. First, we built a morphosyntactic dictionary of Rusyn, applying bilingual lexicon induction techniques on corpora from the East Slavic languages Russian and Ukrainian and on cyrillicized corpora from the West Slavic languages Slovak and Polish (Rabus and Scherrer 2017). Word forms were matched by vowel-sensitive Levenshtein distance, manually written transformation rules and combinations of both.…”
Section: Previous Workmentioning
confidence: 99%