Proceedings of the Third Conference on Machine Translation: Shared Task Papers 2018
DOI: 10.18653/v1/w18-6484
|View full text |Cite
|
Sign up to set email alerts
|

The ILSP/ARC submission to the WMT 2018 Parallel Corpus Filtering Shared Task

Abstract: This paper describes the submission of the Institute for Language and Speech Processing/Athena Research and Innovation Center (ILSP/ARC) for the WMT 2018 Parallel Corpus Filtering shared task. We explore several properties of sentences and sentence pairs that our system explored in the context of the task with the purpose of clustering sentence pairs according to their appropriateness in training MT systems. We also discuss alternative methods for ranking the sentence pairs of the most appropriate clusters wit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 5 publications
0
5
0
Order By: Relevance
“…The ParaCrawl project successfully created a large-scale parallel corpus between English and other European languages by extensively crawling the web (Bañón et al, 2020). Typical bitext-mining projects, including ParaCrawl, took the following steps to identify parallel sentences from the web (Resnik and Smith, 2003): (1) find multilingual websites, which may contain parallel sentences, from the web (Papavassiliou et al, 2018;Bañón et al, 2020); (2) find parallel documents from websites (Thompson and Koehn, 2020;El-Kishky and Guzmán, 2020); (3) extract parallel sentences from parallel web URLs (Thompson and Koehn, 2019;Chousa et al, 2020). Our work focuses on the first step: finding bilingual targetdomain web URLs.…”
Section: Collecting Parallel Sentences From the Webmentioning
confidence: 99%
“…The ParaCrawl project successfully created a large-scale parallel corpus between English and other European languages by extensively crawling the web (Bañón et al, 2020). Typical bitext-mining projects, including ParaCrawl, took the following steps to identify parallel sentences from the web (Resnik and Smith, 2003): (1) find multilingual websites, which may contain parallel sentences, from the web (Papavassiliou et al, 2018;Bañón et al, 2020); (2) find parallel documents from websites (Thompson and Koehn, 2020;El-Kishky and Guzmán, 2020); (3) extract parallel sentences from parallel web URLs (Thompson and Koehn, 2019;Chousa et al, 2020). Our work focuses on the first step: finding bilingual targetdomain web URLs.…”
Section: Collecting Parallel Sentences From the Webmentioning
confidence: 99%
“…Bitext Quality Improvement The most standardized approach to improving bitext either discards an example or treats it as a perfect training instance . Past submissions to the Parallel Corpus Filtering WMT shared task employ a diverse set of approaches covering simple pre-filtering rules based on language identifiers and sentence features (Rossenbach et al, 2018;Lu et al, 2018;Ash et al, 2018), learning to weight scoring functions based on language models, extracting features from neural translation models and lexical translation probabilities (Sánchez-Cartagena et al, 2018), combining pre-trained embeddings (Papavassiliou et al, 2018), and dualcross entropy . In contrast to prior work, and similar to ours, Briakou and Carpuat (2022) propose to revise imperfect translations in bitext via selectively replace them with synthetic translations generated by NMT of sufficient quality.…”
Section: Issues In Bitext Qualitymentioning
confidence: 99%
“…The most standardized approach to improving bitext either discards an example or treats it as a perfect training instance . Past submissions to the "Parallel Corpus Filtering" WMT shared task employ a diverse set of approaches covering simple pre-filtering rules based on language identifiers and sentence features (Rossenbach et al, 2018;Lu et al, 2018;Ash et al, 2018), learning to weight scoring functions based on language models, extracting features from neural translation models and lexical translation probabilities (Sánchez-Cartagena et al, 2018), combining pre-trained embeddings (Papavassiliou et al, 2018), and dual-cross entropy . Our work builds on top of prior work and instead of filtering out all the imperfect translation candidates, we selectively edit them and keep them in the pool of training data for NMT.…”
Section: Bitext Filteringmentioning
confidence: 99%