2020
DOI: 10.1017/s1351324920000224
|View full text |Cite
|
Sign up to set email alerts
|

Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks

Abstract: In this paper, we present a novel lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. In the proposed method, our context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. We argue that a sliding window context representation suffers from sparseness, while in majority of cases the morphosyntactic features of a word br… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
44
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 32 publications
(44 citation statements)
references
References 27 publications
0
44
0
Order By: Relevance
“…We report precision, recall and F1-score for indomain senses and out-of-domain senses, except for Lithuanian, where not enough examples are available. Precision and recall are computed as follows: 9 Precision = # examples with correct target words # examples with either correct or incorrect target words 8 We used the Turku neural lemmatizer with pretrained models (Kanerva et al, 2019). For Lithuanian, as no pretrained model was available, we trained one using the respective available data from the Universal Dependencies project.…”
Section: Wmt 2019 Test Suite Resultsmentioning
confidence: 99%
“…We report precision, recall and F1-score for indomain senses and out-of-domain senses, except for Lithuanian, where not enough examples are available. Precision and recall are computed as follows: 9 Precision = # examples with correct target words # examples with either correct or incorrect target words 8 We used the Turku neural lemmatizer with pretrained models (Kanerva et al, 2019). For Lithuanian, as no pretrained model was available, we trained one using the respective available data from the Universal Dependencies project.…”
Section: Wmt 2019 Test Suite Resultsmentioning
confidence: 99%
“…Lemmatization is a process in text preprocessing that determines the shape of a word and change it into a root word or finding the root of each word based on the context of the sentence [6]. The purpose of the lemmatization is to optimize the text mining process.…”
Section: Lemmatizationmentioning
confidence: 99%
“…Lemmatisation has been of interest in NLP for the last few decades [Hann, 1974]. Since then, tools for lemmatisation have been divided into universal lemmatisers [Straka et al, 2017] [Bergmanis and Goldwater, 2018] [Kanerva et al, 2020] and specific lemmatisers designed to execute a particular task, for instance, for a particular language [Džeroski and Erjavec, 2001] [Groenewald, 2007] [Tamburini, 2013] or for a particular POS [Prinsloo, 2012] [Gouws and Prinsloo, 2012] [Nthambeleni and Musehane, 2014], or a group of words within a POS [Fernández, 2020], or a class of words with a very specific behaviour, such as words within fixed expressions [Farkas et al, 2008] [Mulhall, 2008] [Kosch, 2016]. One approach unites both lemmatiser and tagger in a single model [Spyns, 1996] [Aduriz et al, 1998].…”
Section: Previous Workmentioning
confidence: 99%
“…Thus, automatic lemmatisation with this approach may be defined as a learning task of determining of a lemmatising rule on the basis of a given word, and using it to acquire the lemma of the given word. The second definition relies on the newer approach that appeared during the last decade in which lemmatisation is made in one step [Kanerva et al, 2020]. A tool takes a given word and auxiliary information, such as the POS tag, or morphological data, or left context, and produces a lemma.…”
Section: Introductionmentioning
confidence: 99%