Proceedings of the Third Conference on Machine Translation: Shared Task Papers 2018
DOI: 10.18653/v1/w18-6425
|View full text |Cite
|
Sign up to set email alerts
|

The University of Helsinki submissions to the WMT18 news task

Abstract: This paper describes the University of Helsinki's submissions to the WMT18 shared news translation task for English-Finnish and English-Estonian, in both directions. This year, our main submissions employ a novel neural architecture, the Transformer, using the open-source OpenNMT framework. Our experiments couple domain labeling and fine tuned multilingual models with shared vocabularies between the source and target language, using the provided parallel data of the shared task and additional back-translations… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2018
2018
2019
2019

Publication Types

Select...
3

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(8 citation statements)
references
References 11 publications
0
8
0
Order By: Relevance
“…HY-AH (Raganato et al, 2018;Hurskainen and Tiedemann, 2017) is a rule-based machine translation system, relying on a rule-based dependency parser for English, a hand-crafted translation lexicon (based on dictionary data extracted from parallel corpora by word alignment), various types of transfer rules, and a morphological generator for Finnish.…”
Section: Gtcommentioning
confidence: 99%
See 2 more Smart Citations
“…HY-AH (Raganato et al, 2018;Hurskainen and Tiedemann, 2017) is a rule-based machine translation system, relying on a rule-based dependency parser for English, a hand-crafted translation lexicon (based on dictionary data extracted from parallel corpora by word alignment), various types of transfer rules, and a morphological generator for Finnish.…”
Section: Gtcommentioning
confidence: 99%
“…HY-NMT (Raganato et al, 2018) submissions are based on the Transformer "base" model, trained with all the parallel data provided by the shared task plus back-translations, with a shared vocabulary between source and target language and a domain label for each source sentence. For the multilingual sub-track synthetic data for English→Estonian and Estonian→English was also used.…”
Section: Gtcommentioning
confidence: 99%
See 1 more Smart Citation
“…For English and Finnish respectively, we trained truecaser models on all English and Finnish monolingual data as well as the English and Finnish side of all parallel English-Finnish datasets. As we had found to be optimal in our previous year submission ( Raganato et al, 2018), we trained a BPE model using a vocabulary of 37 000 symbols, trained jointly only on the parallel data. Furthermore, for some experiments, we also used domain labeling.…”
Section: Pre-processingmentioning
confidence: 99%
“…We trained the segmentation models only on the available parallel datasets for each language pair, following the findings of our submission to the WMT18 translation task (Raganato et al, 2018). We specified a vocabulary size of 5,000 tokens for each language and we took advantage from the tokenizer integrated in the SentencePiece implementation (Kudo and Richardson, 2018) by training the models on non-tokenized data.…”
Section: Sentencepiece Unigram Modelsmentioning
confidence: 99%