Proceedings of the Workshop on Balto-Slavonic Natural Language Processing Information Extraction and Enabling Technologies - AC 2007
DOI: 10.3115/1567545.1567563
|View full text |Cite
|
Sign up to set email alerts
|

Morphological annotation of the Lithuanian corpus

Abstract: As the development of information technologies makes progress, large morphologically annotated corpora become a necessity, as they are necessary for moving onto higher levels of language computerisation (e. g. automatic syntactic and semantic analysis, information extraction, machine translation). Research of morphological disambiguation and morphological annotation of the 100 million word Lithuanian corpus are presented in the article. Statistical methods have enabled to develop the automatic tool of morpholo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2014
2014
2018
2018

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(5 citation statements)
references
References 2 publications
0
5
0
Order By: Relevance
“…Two techniques were chosen for the pre-processing of the document title data set pre-processing were chosen two techniques: tokenisation and lemmatisation ( Table 1). The text datawas lemmatised using the Lithuanian morphological analyser-lemmatiser 'Lemuoklis' (Daudaravičius et al, 2007;Zinkevičius, 2000). The removing of stopping words was not applied because the list of stop words for Lithuanian language is not defined yet and such words do not cause big problem in Lithuanian anyhow, having in mind the inflective nature of the language (Kapočiūtė-Dzikienė et al, 2012).…”
Section: Problem Formulationmentioning
confidence: 99%
“…Two techniques were chosen for the pre-processing of the document title data set pre-processing were chosen two techniques: tokenisation and lemmatisation ( Table 1). The text datawas lemmatised using the Lithuanian morphological analyser-lemmatiser 'Lemuoklis' (Daudaravičius et al, 2007;Zinkevičius, 2000). The removing of stopping words was not applied because the list of stop words for Lithuanian language is not defined yet and such words do not cause big problem in Lithuanian anyhow, having in mind the inflective nature of the language (Kapočiūtė-Dzikienė et al, 2012).…”
Section: Problem Formulationmentioning
confidence: 99%
“…Europos (Europe, in genitive) would be replaced with Europa (Europe, in nominative); modifikuotas (modified) would be replaced with modifikuoti (to modify). For lemmatizing the text we used Lithuanian part-of-speech tagger and lemmatizer "Lemuoklis" [35,36]. It should be emphasized that this feature type is strongly recommended for the highly inflective languages, because lemmatization significantly decreases the sparseness of the data.…”
Section: Explored Feature Typesmentioning
confidence: 99%
“…Hungarian-Lithuanian As for Hungarian and Lithuanian, the Lithuanian Centre of Computational Linguistics, Vytautas Magnus University 2 provided us with sentence segmented and morphologically disambiguated texts. We selected the Lithuanian texts from the Lithuanian National Corpus (Rimkutė et al, 2007) and from the Lithuanian-English parallel corpus (Rimkutė et al, 2008) for which Hungarian counterparts were available. The annotated texts were manually checked to detect missing parts and insertions 3 .…”
Section: Proof-of-concept Experiments: One-token Unitsmentioning
confidence: 99%
“…12: Hungarian case suffixes As for Lithuanian we have considered three types of information: part-of-speech category, gender and case, based onRimkutė et al (2007).…”
mentioning
confidence: 99%