2011
DOI: 10.1007/s10579-011-9174-8
|View full text |Cite
|
Sign up to set email alerts
|

MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

Abstract: The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the morphosyntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel ''1984'' by George Orwell, which is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guideline… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
38
0
6

Year Published

2014
2014
2023
2023

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 60 publications
(44 citation statements)
references
References 16 publications
0
38
0
6
Order By: Relevance
“…The first was the Slovene Dependency Treebank (Džeroski et al, 2006), based on the Prague Dependency Treebank (PDT) annotation scheme (Hajičová et al, 1999) and consisting of approximately 30,000 tokens taken from the Slovenian component of the parallel MULTEXTEast corpus (Erjavec, 2012), i.e., the Slovenian translation of the novel "1984" by George Orwell.…”
Section: Dependency Treebanks For Slovenianmentioning
confidence: 99%
See 1 more Smart Citation
“…The first was the Slovene Dependency Treebank (Džeroski et al, 2006), based on the Prague Dependency Treebank (PDT) annotation scheme (Hajičová et al, 1999) and consisting of approximately 30,000 tokens taken from the Slovenian component of the parallel MULTEXTEast corpus (Erjavec, 2012), i.e., the Slovenian translation of the novel "1984" by George Orwell.…”
Section: Dependency Treebanks For Slovenianmentioning
confidence: 99%
“…Within this scheme, the syntactic annotation layer consists of only 10 dependency relations, following the general assumption that specific syntactic constructions can be retrieved by combining these labels with the underlying word-level morphosyntactic descriptions (MSDs), wherein the JOS MSD tagset 3 is identical to the tagset defined in the MULTEXT-East Version 4 morphosyntactic specifications for Slovene (Erjavec, 2012).…”
Section: Dependency Treebanks For Slovenianmentioning
confidence: 99%
“…The tagset used is defined in the (draft) MULTEXT-East morphosyntactic specification Version 5 2 for Slovene, which are identical to the Version 4 specifications (Erjavec, 2012), except that four new tags have been added for CMC specific phenomena, such as hashtags and mentions. Version 5 tagset for Slovene defines all together 1900 different tags (morphosyntactic descriptions, MSDs), i.e., it is a fine-grained tagset covering all the inflectional properties of Slovene words.…”
Section: Cmc Datasetmentioning
confidence: 99%
“…All of them were taken from the MULTEXT-East repository (Erjavec et al, 2010a;Erjavec et al, 2010b;Erjavec, 2012). As Rusyn is written in Cyrillic script, we converted the Slovak and Polish dictionaries into Cyrillic script first.…”
Section: Datamentioning
confidence: 99%