MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

Erjavec, Tomaž

doi:10.1007/s10579-011-9174-8

Cited by 60 publications

(44 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The first was the Slovene Dependency Treebank (Džeroski et al, 2006), based on the Prague Dependency Treebank (PDT) annotation scheme (Hajičová et al, 1999) and consisting of approximately 30,000 tokens taken from the Slovenian component of the parallel MULTEXTEast corpus (Erjavec, 2012), i.e., the Slovenian translation of the novel "1984" by George Orwell.…”

Section: Dependency Treebanks For Slovenianmentioning

confidence: 99%

“…Within this scheme, the syntactic annotation layer consists of only 10 dependency relations, following the general assumption that specific syntactic constructions can be retrieved by combining these labels with the underlying word-level morphosyntactic descriptions (MSDs), wherein the JOS MSD tagset 3 is identical to the tagset defined in the MULTEXT-East Version 4 morphosyntactic specifications for Slovene (Erjavec, 2012).…”

Section: Dependency Treebanks For Slovenianmentioning

confidence: 99%

See 1 more Smart Citation

The Universal Dependencies Treebank for Slovenian

Dobrovoljc¹,

Erjavec²,

Krek³

2017

Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

Self Cite

View full text Add to dashboard Cite

This paper introduces the Universal Dependencies Treebank for Slovenian. We overview the existing dependency treebanks for Slovenian and then detail the conversion of the ssj200k treebank to the framework of Universal Dependencies version 2. We explain the mapping of part-of-speech categories, morphosyntactic features, and the dependency relations, focusing on the more problematic language-specific issues. We conclude with a quantitative overview of the treebank and directions for further work.

show abstract

Section: Dependency Treebanks For Slovenianmentioning

confidence: 99%

Section: Dependency Treebanks For Slovenianmentioning

confidence: 99%

The Universal Dependencies Treebank for Slovenian

Dobrovoljc¹,

Erjavec²,

Krek³

2017

Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…The tagset used is defined in the (draft) MULTEXT-East morphosyntactic specification Version 5 2 for Slovene, which are identical to the Version 4 specifications (Erjavec, 2012), except that four new tags have been added for CMC specific phenomena, such as hashtags and mentions. Version 5 tagset for Slovene defines all together 1900 different tags (morphosyntactic descriptions, MSDs), i.e., it is a fine-grained tagset covering all the inflectional properties of Slovene words.…”

Section: Cmc Datasetmentioning

confidence: 99%

Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text

Ljubešić¹,

Erjavec²,

Fišer³

2017

Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

Self Cite

View full text Add to dashboard Cite

In this paper we present the adaptations of a state-of-the-art tagger for South Slavic languages to non-standard texts on the example of the Slovene language. We investigate the impact of introducing in-domain training data as well as additional supervision through external resources or tools like word clusters and word normalization. We remove more than half of the error of the standard tagger when applied to nonstandard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error. The final configuration achieves tagging accuracy of 87.41% on the full morphosyntactic description, which is, nevertheless, still quite far from the accuracy of 94.27% achieved on standard text.

show abstract

“…All of them were taken from the MULTEXT-East repository (Erjavec et al, 2010a;Erjavec et al, 2010b;Erjavec, 2012). As Rusyn is written in Cyrillic script, we converted the Slovak and Polish dictionaries into Cyrillic script first.…”

Section: Datamentioning

confidence: 99%

Lexicon Induction for Spoken Rusyn – Challenges and Results

Rabus¹,

Scherrer²

2017

Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

View full text Add to dashboard Cite

This paper reports on challenges and results in developing NLP resources for spoken Rusyn. Being a Slavic minority language, Rusyn does not have any resources to make use of. We propose to build a morphosyntactic dictionary for Rusyn, combining existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish. We adapt these resources to Rusyn by using vowel-sensitive Levenshtein distance, hand-written language-specific transformation rules, and combinations of the two. Compared to an exact match baseline, we increase the coverage of the resulting morphological dictionary by up to 77.4% relative (42.9% absolute), which results in a tagging recall increased by 11.6% relative (9.1% absolute). Our research confirms and expands the results of previous studies showing the efficiency of using NLP resources from neighboring languages for low-resourced languages.

show abstract

MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

Cited by 60 publications

References 16 publications

The Universal Dependencies Treebank for Slovenian

The Universal Dependencies Treebank for Slovenian

Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text

Lexicon Induction for Spoken Rusyn – Challenges and Results

Contact Info

Product

Resources

About