Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1090
|View full text |Cite
|
Sign up to set email alerts
|

Don’t Forget the Long Tail! A Comprehensive Analysis of Morphological Generalization in Bilingual Lexicon Induction

Abstract: Human translators routinely have to translate rare inflections of words-due to the Zipfian distribution of words in a language. When translating from Spanish, a good translator would have no problem identifying the proper translation of a statistically rare inflection such as hablarámos. Note the lexeme itself, hablar, is relatively common. In this work, we investigate whether state-of-the-art bilingual lexicon inducers are capable of learning this kind of generalization. We introduce 40 morphologically comple… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
45
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
2
1

Relationship

2
8

Authors

Journals

citations
Cited by 32 publications
(46 citation statements)
references
References 11 publications
1
45
0
Order By: Relevance
“…Glavaš et al (2019), for instance, highlighted Anglocentricity as an issue, creating and evaluating on 28 dictionaries between 8 languages (Croatian, English, Finnish, French, German, Italian, Russian, Turkish) based on Google Translate. In addition, Czarnowska et al (2019) focused on the morphology dimension, creating morphologically complete dictionaries for 2 sets of 5 genetically related languages (Romance: French, Spanish, Italian, Portuguese, Catalan; and Slavic: Polish, Czech, Slovak, Russian, Ukrainian). In contrast to these two (very valuable!)…”
Section: New LI Evaluation Dictionariesmentioning
confidence: 99%
“…Glavaš et al (2019), for instance, highlighted Anglocentricity as an issue, creating and evaluating on 28 dictionaries between 8 languages (Croatian, English, Finnish, French, German, Italian, Russian, Turkish) based on Google Translate. In addition, Czarnowska et al (2019) focused on the morphology dimension, creating morphologically complete dictionaries for 2 sets of 5 genetically related languages (Romance: French, Spanish, Italian, Portuguese, Catalan; and Slavic: Polish, Czech, Slovak, Russian, Ukrainian). In contrast to these two (very valuable!)…”
Section: New LI Evaluation Dictionariesmentioning
confidence: 99%
“…This over-emphasis on frequent types, when carried forward into downstream generation tasks, may lead to the failure mode described by Holtzman et al (2020) in which generated text is "dull and repetitive." This phenomenon is not limited to words alone; morphologically-rich languages (MRLs) exhibit a similar Zipfian distributional pattern in terms of the occurrence of different morphological phenomena, which in turn affects the performance of systems designed to process such features of language (Czarnowska et al, 2019;Tsarfaty et al, 2020). We believe that this behavior can be explained through the lens of the bias-variance tradeoff common to all statistical learning problems.…”
Section: Lexical Frequency and Diversitymentioning
confidence: 98%
“…Training and test dictionaries Standard BLI test dictionaries over-emphasise frequent words (Czarnowska et al, 2019; logical parameters of a language. 6 We ran experiments and observed similar results with word2vec algorithms (Mikolov et al, 2013b), GloVe (Pennington et al, 2014) and fastText CBOW (Grave et al, 2018).…”
Section: Mapping Algorithmmentioning
confidence: 99%