The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation

Sälevä, Jonne; Lignos, Constantine

doi:10.18653/v1/2021.eacl-srw.22

Cited by 7 publications

(4 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unsupervised morphological segmentation has not shown consistent improvements (Zhou, 2018;Saleva and Lignos, 2021;Domingo et al, 2023) Koehn and Knight, 2003;Riedl and Biemann, 2016;Tuggener, 2016) since most prior datasets do not contain any (Henrich and Hinrichs, 2011;van Zaanen et al, 2014). It is not trivial to obtain negative examples from Wiktionary since a large amount of compound words are not categorized as such, leading to many false negatives.…”

Section: Related Workmentioning

confidence: 99%

CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

Minixhofer,

Pfeiffer,

Vulić

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

While many languages possess processes of joining two or more words to create compound words, previous studies have been typically limited only to languages with excessively productive compound formation (e.g., German, Dutch) and there is no public dataset containing compound and non-compound words across a large number of languages. In this work, we systematically study decompounding, the task of splitting compound words into their constituents, at a wide scale. We first address the data gap by introducing a dataset of 255k compound and noncompound words across 56 diverse languages obtained from Wiktionary. We then use this dataset to evaluate an array of Large Language Models (LLMs) on the decompounding task. We find that LLMs perform poorly, especially on words which are tokenized unfavorably by subword tokenization. We thus introduce a novel methodology to train dedicated models for decompounding. The proposed two-stage procedure relies on a fully self-supervised objective in the first stage, while the second, supervised learning stage optionally fine-tunes the model on the annotated Wiktionary data. Our self-supervised models outperform the prior best unsupervised decompounding models by 13.9% accuracy on average. Our fine-tuned models outperform all prior (language-specific) decompounding tools. Furthermore, we use our models to leverage decompounding during the creation of a subword tokenizer, which we refer to as CompoundPiece. CompoundPiece tokenizes compound words more favorably on average, leading to improved performance on decompounding over an otherwise equivalent model using SentencePiece tokenization.

show abstract

Section: Related Workmentioning

confidence: 99%

CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

Minixhofer,

Pfeiffer,

Vulić

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Several researchers have investigated the effect of applying different morphological and agnostic segmentation approaches on the MT performance for monolingual languages. Roest et al (2020); Saleva and Lignos (2021) show that unsupervised morphology-based segmentation like Linguistically Motivated Vocabulary Reduction (LMVR) (Ataman et al, 2017), Morfessor , and FlatCat for Nepali-English, Sinhala-English, Kazakh-English, and Inuktitut-English language pairs show either no improvement or no significant improvement over the agnostic BPE segmentation (Sennrich et al, 2016) in translation tasks. Meanwhile, Mager et al (2022) and Ataman et al (2017) show that for polysynthetic and highly agglutinative languages, unsupervised morphology-based segmentation outperforms BPEs (Sennrich et al, 2016) in MT tasks in both directions.…”

Section: Related Workmentioning

confidence: 99%

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Gaser,

Mager,

Hamed

et al. 2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.

show abstract

“…Using unsupervisedly obtained "morphological" subwords on the other hand, only Ataman and Federico (2018b) find that a model based on Morfessor FlatCat can outperform BPE; Zhou (2018), , Macháček et al (2018), and Saleva and Lignos (2021) find no reliable improvement over BPE for translation. Banerjee and Bhattacharyya (2018) analyze translations obtained segmenting with Morfessor and BPE, and conclude that a possible improvement depends on the similarity of the languages.…”

Section: Comparing Morphological Segmentation To Bpe and Friendsmentioning

confidence: 99%

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

Mielke,

Alyafeai,

Salesky

et al. 2021

Preprint

View full text Add to dashboard Cite

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subwordbased approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications. * Working group chairs 1 A Twitter bot with (human-curated) outputs of a language model based on GPT2 (Radford et al., 2019) and trained on tweets of Twitter poet @dril; https://twitter.com/ dril_gpt2/status/1373596260612067333.

show abstract

The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation

Cited by 7 publications

References 15 publications

CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

Contact Info

Product

Resources

About