Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2017
DOI: 10.18653/v1/d17-1319
|View full text |Cite
|
Sign up to set email alerts
|

Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora

Abstract: We introduce Zipporah, a fast and scalable data cleaning system. We propose a novel type of bag-of-words translation feature, and train logistic regression models to classify good data and synthetic noisy data in the proposed feature space. The trained model is used to score parallel sentences in the data pool for selection. As shown in experiments, Zipporah selects a high-quality parallel corpus from a large, mixed quality data pool. In particular, for one noisy dataset, Zipporah achieves a 2.1 BLEU score imp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
62
0
2

Year Published

2018
2018
2023
2023

Publication Types

Select...
6
4

Relationship

1
9

Authors

Journals

citations
Cited by 61 publications
(64 citation statements)
references
References 2 publications
0
62
0
2
Order By: Relevance
“…Zipporah (Xu and Koehn, 2017;, which is often used as a baseline comparison, uses language model and word translation scores, with weights optimized to separate clean and synthetic noise data. In our setup, we trained Zipporah models for both language pairs Sinhala-English and Nepali-English.…”
Section: Other Similarity Methodsmentioning
confidence: 99%
“…Zipporah (Xu and Koehn, 2017;, which is often used as a baseline comparison, uses language model and word translation scores, with weights optimized to separate clean and synthetic noise data. In our setup, we trained Zipporah models for both language pairs Sinhala-English and Nepali-English.…”
Section: Other Similarity Methodsmentioning
confidence: 99%
“…Data For additional unlabeled-domain data, we use web-crawled bitext from the Paracrawl project. 4 We filter the data using the Zipporah cleaning tool (Xu and Koehn, 2017), with a threshold score of 1. After filtering, we have around 13.6 million Paracrawl sentences available for German-English and 3.7 million Paracrawl sentences available for Russian-English.…”
Section: Unlabeled-domainmentioning
confidence: 99%
“…• Source and target language model scores • Hunalign scores (Varga et al, 2007) • Zipporah adequacy scores (Xu and Koehn, 2017), using a probabilistic bilingual dictionary computed on Europarl.…”
Section: Data Selectionmentioning
confidence: 99%