Proceedings of the Third Conference on Machine Translation: Shared Task Papers 2018
DOI: 10.18653/v1/w18-6488
|View full text |Cite
|
Sign up to set email alerts
|

Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task

Abstract: This paper describes Prompsit Language Engineering's submissions to the WMT 2018 parallel corpus filtering shared task. Our four submissions were based on an automatic classifier for identifying pairs of sentences that are mutual translations. A set of hand-crafted hard rules for discarding sentences with evident flaws were applied before the classifier. We explored different strategies for achieving a training corpus with diverse vocabulary and fluent sentences: language model scoring, an active-learning-insp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 29 publications
(11 citation statements)
references
References 11 publications
0
10
1
Order By: Relevance
“…In the case of EN-FR, we observed that removing alignments from a web-scraped corpus did not result in an increase in NMT performance, on the contrary, the decrease in data size outweighed the increased alignment quality. This result differs from previous work on misalignment detection and data cleaning in an NMT context, e.g., [4,11,14,20]. However, we noted that the web-scraped corpus EN-FR used for extrinsic evaluation was much cleaner in terms of misalignments and other noise than the corpora used in previous work, such as the OpenSubtitles and ParaCrawl corpus: the amount of misalignments and degree of misalignment (in terms of MAD score) present in the web-scraped corpus was probably too low to harm NMT performance, as is clear from the amount of data labeled as aligned by MAD (76% of the sentence pairs).…”
Section: Discussioncontrasting
confidence: 93%
See 2 more Smart Citations
“…In the case of EN-FR, we observed that removing alignments from a web-scraped corpus did not result in an increase in NMT performance, on the contrary, the decrease in data size outweighed the increased alignment quality. This result differs from previous work on misalignment detection and data cleaning in an NMT context, e.g., [4,11,14,20]. However, we noted that the web-scraped corpus EN-FR used for extrinsic evaluation was much cleaner in terms of misalignments and other noise than the corpora used in previous work, such as the OpenSubtitles and ParaCrawl corpus: the amount of misalignments and degree of misalignment (in terms of MAD score) present in the web-scraped corpus was probably too low to harm NMT performance, as is clear from the amount of data labeled as aligned by MAD (76% of the sentence pairs).…”
Section: Discussioncontrasting
confidence: 93%
“…Finally, sentences are scored based on fluency and diversity. More details are provided in Reference [20].…”
Section: Bicleanermentioning
confidence: 99%
See 1 more Smart Citation
“…One way to perform filtering is to only keep sentences with a better per-word cross-entropy than a certain threshold. Another way is to use Bicleaner, an off-the-shelf tool which scores sentence similarity at sentence pair level (Sánchez-Cartagena et al, 2018). Filtering is optional for post-expansion pruning.…”
Section: Filteringmentioning
confidence: 99%
“…When processing bilingual corpora, any meaning mismatches between the two languages are primarily viewed as noise for the downstream task. In shared tasks for filtering web-crawled parallel corpora (Koehn et al, 2018, the best performing systems rely on translation models, or cross-lingual sentence embeddings to place bilingual sentences on a clean to noisy scale (Junczys-Dowmunt, 2018;Sánchez-Cartagena et al, 2018;Lu et al, 2018;. When mining parallel segments in Wikipedia for the Wiki-Matrix corpus , examples are ranked using the LASER score (Artetxe and Schwenk, 2019), which computes cross-lingual similarity in a language-agnostic sentence embedding space.…”
Section: Introductionmentioning
confidence: 99%