Proceedings of the Third Conference on Machine Translation: Research Papers 2018
DOI: 10.18653/v1/w18-6317
|View full text |Cite
|
Sign up to set email alerts
|

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

Abstract: This paper presents an effective approach for parallel corpus mining using bilingual sentence embeddings. Our embedding models are trained to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. This is achieved using a novel training method that introduces hard negatives consisting of sentences that are not translations but that have some degree of semantic similarity. The quality of the resulting embeddings are evaluated on parallel corpus reconstructi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
129
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 90 publications
(131 citation statements)
references
References 22 publications
2
129
0
Order By: Relevance
“…On a test set mined from the web, all models achieve strong retrieval performance, the best being 91.4% P@1 for en-fr and 81.8% for en-es from the hierarchical document models. On the United Nations (UN) document mining task (Ziemski et al, 2016), our best model achieves 96.7% P@1 for en-fr and 97.3% P@1 for en-es, a 3%+ absolute improvement over the prior state-of-the-art (Guo et al, 2018;Uszkoreit et al, 2010). We also evaluate on a noisier version of the UN task where we do not have the ground truth sentence alignments from the original corpus.…”
Section: Introductionmentioning
confidence: 94%
See 2 more Smart Citations
“…On a test set mined from the web, all models achieve strong retrieval performance, the best being 91.4% P@1 for en-fr and 81.8% for en-es from the hierarchical document models. On the United Nations (UN) document mining task (Ziemski et al, 2016), our best model achieves 96.7% P@1 for en-fr and 97.3% P@1 for en-es, a 3%+ absolute improvement over the prior state-of-the-art (Guo et al, 2018;Uszkoreit et al, 2010). We also evaluate on a noisier version of the UN task where we do not have the ground truth sentence alignments from the original corpus.…”
Section: Introductionmentioning
confidence: 94%
“…Previous work on parallel document mining using large distributed systems has proven effective (Uszkoreit et al, 2010;Antonova and Misyurev, 2011), but these systems are often heavily engineered and computationally intensive. Recent work on parallel data mining has focused on sentence-level embeddings (Guo et al, 2018;Artetxe and Schwenk, 2018;Yang et al, 2019). However, these sentence embedding methods have had limited success when applied to documentlevel mining tasks (Guo et al, 2018).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Baselines. We compare to several models from prior work (Guo et al, 2018;Chidambaram et al, 2018). A fair comparison to other models is difficult due to different training setups.…”
Section: Monolingual and Cross-lingual Similaritymentioning
confidence: 99%
“…Most previous work for cross-lingual representations has focused on models based on encoders from neural machine translation (Espana-Bonet et al, 2017;Schwenk and Douze, 2017;Schwenk, 2018) or deep architectures using a contrastive loss (Grégoire and Langlais, 2018;Guo et al, 2018;Chidambaram et al, 2018). However, the paraphrastic sentence embedding literature has observed that simple models such as pooling word embeddings generalize significantly better than complex architectures (Wieting et al, 2016b).…”
Section: Introductionmentioning
confidence: 99%