Findings of the Association for Computational Linguistics: EMNLP 2021 2021
DOI: 10.18653/v1/2021.findings-emnlp.60
|View full text |Cite
|
Sign up to set email alerts
|

How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Abstract: Data-driven subword segmentation has become the default strategy for open-vocabulary machine translation and other NLP tasks, but may not be sufficiently generic for optimal learning of non-concatenative morphology. We design a test suite to evaluate segmentation strategies on different types of morphological phenomena in a controlled, semisynthetic setting. In our experiments, we compare how well machine translation models trained on subword-and character-level can translate these morphological phenomena. We … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 39 publications
0
5
0
Order By: Relevance
“…Our study also relates to computational work on derivational morphology Cotterell and Schütze, 2018;Deutsch et al, 2018;Hofmann et al, 2020a,b,c) and word segmentation Kann et al, 2016;Ruzsics and Samardžić, 2017;Mager et al, 2019Mager et al, , 2020Seker and Tsarfaty, 2020;Amrhein and Sennrich, 2021). We are the first to systematically evaluate the segmentations of PLM tokenizers on human-annotated gold data.…”
Section: Related Workmentioning
confidence: 92%
“…Our study also relates to computational work on derivational morphology Cotterell and Schütze, 2018;Deutsch et al, 2018;Hofmann et al, 2020a,b,c) and word segmentation Kann et al, 2016;Ruzsics and Samardžić, 2017;Mager et al, 2019Mager et al, , 2020Seker and Tsarfaty, 2020;Amrhein and Sennrich, 2021). We are the first to systematically evaluate the segmentations of PLM tokenizers on human-annotated gold data.…”
Section: Related Workmentioning
confidence: 92%
“…There is a growing body of work showing how statistical word segmentation methods adversely affect the performance of pretrained language models when dealing with morphologically rich languages like Arabic, Hebrew, Turkish, etc. (Amrhein and Sennrich (2021), Keren et al (2022)). Instead of being completely data-driven, these studies advocate for subword tokenization techniques to be linguistically motivated such that the subwords adhere to morpheme boundaries.…”
Section: Related Workmentioning
confidence: 99%
“…Reducing the number of OOV terms is particularly important for the latter case, as downstream tasks such as classification would incur too high a loss of information if they were just removed. However, it has been argued that languages such as these, as well as agglutinative languages, may be better served with character-level models or small sub-word inventories [ 27 , 28 ], even though sub-word segmentation has reasonable motivation [ 8 ]. Cases like these reinforce the notion that there is no single best solution for language segmentation.…”
Section: Preprocessingmentioning
confidence: 99%