The University of Helsinki Submissions to the WMT19 Similar Language Translation Task

Scherrer, Yves; Vázquez, Raúl; Virpioja, Sámi

doi:10.18653/v1/w19-5432

Cited by 3 publications

(3 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• We voluntarily restrict our dataset to "clean" corpora, i.e., interviews transcribed and normalized by trained experts. This contrasts with other data collections specifically aimed at extracting dialectal content from social media (e.g., Ueberwasser and Stark, 2017;Mubarak, 2018;Barnes et al, 2021;Kuparinen, 2023). Such datasets compound the features and challenges of both dialect-to-standard normalization and UGC normalization.…”

Section: Limitationsmentioning

confidence: 96%

Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation

Kuparinen,

Miletić,

Scherrer

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Text normalization methods have been commonly applied to historical language or usergenerated content, but less often to dialectal transcriptions. In this paper, we introduce dialect-to-standard normalization -i.e., mapping phonetic transcriptions from different dialects to the orthographic norm of the standard variety -as a distinct sentence-level character transduction task and provide a large-scale analysis of dialect-to-standard normalization methods. To this end, we compile a multilingual dataset covering four languages: Finnish, Norwegian, Swiss German and Slovene. For the two biggest corpora, we provide three different data splits corresponding to different use cases for automatic normalization. We evaluate the most successful sequence-to-sequence model architectures proposed for text normalization tasks using different tokenization approaches and context sizes. We find that a characterlevel Transformer trained on sliding windows of three words works best for Finnish, Swiss German and Slovene, whereas the pre-trained byT5 model using full sentences obtains the best results for Norwegian. Finally, we perform an error analysis to evaluate the effect of different data splits on model performance.

show abstract

Section: Limitationsmentioning

confidence: 96%

Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation

Kuparinen,

Miletić,

Scherrer

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…This system, however, makes many suboptimal design choices and ended up as the last one in the manual evaluation. Scherrer et al (2019) experimented with character-level systems for similar language translation and observed that characters outperform other segmentations for Spanish-Portuguese translation, but not for Czech-Polish. Knowles et al (2020) experimented with different subword vocabulary sizes for English-Inuktikut translation and reached the best results using a subword vocabulary of size 1k, which makes it close to the character level.…”

Section: Wmt Submissionsmentioning

confidence: 99%

Why don’t people use character-level machine translation?

Libovický¹,

Schmid²,

Fraser

2022

Findings of the Association for Computational Linguistics: ACL 2022

View full text Add to dashboard Cite

We present a literature and empirical survey that critically assesses the state of the art in character-level modeling for machine translation (MT). Despite evidence in the literature that character-level systems are comparable with subword systems, they are virtually never used in competitive setups in WMT competitions. We empirically show that even with recent modeling innovations in characterlevel natural language processing, characterlevel MT systems still struggle to match their subword-based counterparts. Character-level MT systems show neither better domain robustness, nor better morphological generalization, despite being often so motivated. However, we are able to show robustness towards source side noise and that translation quality does not degrade with increasing beam size at decoding time.

show abstract

“…This system however makes many unusual and suboptimal design choices and ended up as the last one in the manual evaluation. Scherrer et al (2019) experimented with character-level systems for similar language translation and observed that characters outperform other segmentations for Spanish-Portuguese translation, but not Czech-Polish. Knowles et al (2020) experimented with differ-ent subword vocabulary sizes for English-Inuktikut translation and reached the best results used a subword vocabulary of size 1k, which makes it close to the character level.…”

Section: Wmt Submissionsmentioning

confidence: 99%

Why don't people use character-level machine translation?

Libovický¹,

Schmid²,

Fraser³

2021

Preprint

View full text Add to dashboard Cite

We present a literature and empirical survey that critically assesses the state of the art in character-level modeling for machine translation (MT). Despite evidence in the literature that character-level systems are comparable with subword systems, they are virtually never used in competitive setups in WMT competitions. We empirically show that even with recent modeling innovations in characterlevel natural language processing, characterlevel MT systems still struggle to match their subword-based counterparts both in terms of translation quality and training and inference speed. Character-level MT systems show neither better domain robustness, nor better morphological generalization, despite being often so motivated. On the other hand, they tend to be more robust towards source side noise and the translation quality does not degrade with increasing beam size at decoding time.

show abstract

The University of Helsinki Submissions to the WMT19 Similar Language Translation Task

Cited by 3 publications

References 34 publications

Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation

Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation

Why don’t people use character-level machine translation?

Why don't people use character-level machine translation?

Contact Info

Product

Resources

About