Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2) 2019
DOI: 10.18653/v1/w19-5432
|View full text |Cite
|
Sign up to set email alerts
|

The University of Helsinki Submissions to the WMT19 Similar Language Translation Task

Abstract: This paper describes the University of Helsinki Language Technology group's participation in the WMT 2019 similar language translation task. We trained neural machine translation models for the language pairs Czech ↔ Polish and Spanish ↔ Portuguese. Our experiments focused on different subword segmentation methods, and in particular on the comparison of a cognate-aware segmentation method, Cognate Morfessor, with character segmentation and unsupervised segmentation methods for which the data from different lan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 34 publications
0
3
0
Order By: Relevance
“…• We voluntarily restrict our dataset to "clean" corpora, i.e., interviews transcribed and normalized by trained experts. This contrasts with other data collections specifically aimed at extracting dialectal content from social media (e.g., Ueberwasser and Stark, 2017;Mubarak, 2018;Barnes et al, 2021;Kuparinen, 2023). Such datasets compound the features and challenges of both dialect-to-standard normalization and UGC normalization.…”
Section: Limitationsmentioning
confidence: 96%
“…• We voluntarily restrict our dataset to "clean" corpora, i.e., interviews transcribed and normalized by trained experts. This contrasts with other data collections specifically aimed at extracting dialectal content from social media (e.g., Ueberwasser and Stark, 2017;Mubarak, 2018;Barnes et al, 2021;Kuparinen, 2023). Such datasets compound the features and challenges of both dialect-to-standard normalization and UGC normalization.…”
Section: Limitationsmentioning
confidence: 96%
“…This system, however, makes many suboptimal design choices and ended up as the last one in the manual evaluation. Scherrer et al (2019) experimented with character-level systems for similar language translation and observed that characters outperform other segmentations for Spanish-Portuguese translation, but not for Czech-Polish. Knowles et al (2020) experimented with different subword vocabulary sizes for English-Inuktikut translation and reached the best results using a subword vocabulary of size 1k, which makes it close to the character level.…”
Section: Wmt Submissionsmentioning
confidence: 99%
“…This system however makes many unusual and suboptimal design choices and ended up as the last one in the manual evaluation. Scherrer et al (2019) experimented with character-level systems for similar language translation and observed that characters outperform other segmentations for Spanish-Portuguese translation, but not Czech-Polish. Knowles et al (2020) experimented with differ-ent subword vocabulary sizes for English-Inuktikut translation and reached the best results used a subword vocabulary of size 1k, which makes it close to the character level.…”
Section: Wmt Submissionsmentioning
confidence: 99%