Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

Guo, Mandy; Shen, Qinlan; Yang, Yinfei; Ge, Heming; Cer, Daniel; Ábrego, Gustavo Hernández; Stevens, Keith; Constant, Noah; Sung, Yun-Hsuan; Strope, Brian; Kurzweil, Ray

doi:10.18653/v1/w18-6317

Cited by 90 publications

(131 citation statements)

References 22 publications

Supporting

Mentioning

129

Contrasting

Order By: Relevance

“…On a test set mined from the web, all models achieve strong retrieval performance, the best being 91.4% P@1 for en-fr and 81.8% for en-es from the hierarchical document models. On the United Nations (UN) document mining task (Ziemski et al, 2016), our best model achieves 96.7% P@1 for en-fr and 97.3% P@1 for en-es, a 3%+ absolute improvement over the prior state-of-the-art (Guo et al, 2018;Uszkoreit et al, 2010). We also evaluate on a noisier version of the UN task where we do not have the ground truth sentence alignments from the original corpus.…”

Section: Introductionmentioning

confidence: 94%

“…Previous work on parallel document mining using large distributed systems has proven effective (Uszkoreit et al, 2010;Antonova and Misyurev, 2011), but these systems are often heavily engineered and computationally intensive. Recent work on parallel data mining has focused on sentence-level embeddings (Guo et al, 2018;Artetxe and Schwenk, 2018;Yang et al, 2019). However, these sentence embedding methods have had limited success when applied to documentlevel mining tasks (Guo et al, 2018).…”

Section: Introductionmentioning

confidence: 99%

“…Recent work on parallel data mining has focused on sentence-level embeddings (Guo et al, 2018;Artetxe and Schwenk, 2018;Yang et al, 2019). However, these sentence embedding methods have had limited success when applied to documentlevel mining tasks (Guo et al, 2018). A recent study from Yang et al (2019) shows that document embeddings obtained from averaging sentence embeddings can achieve state-of-the-art performance in document retrieval on the United Nation (UN) corpus.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Hierarchical Document Encoder for Parallel Corpus Mining

Guo¹,

Yang²,

Stevens³

et al. 2019

Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

Self Cite

View full text Add to dashboard Cite

We explore using multilingual document embeddings for nearest neighbor mining of parallel data. Three document-level representations are investigated: (i) document embeddings generated by simply averaging multilingual sentence embeddings; (ii) a neural bagof-words (BoW) document encoding model; (iii) a hierarchical multilingual document encoder (HiDE) that builds on our sentence-level model. The results show document embeddings derived from sentence-level averaging are surprisingly effective for clean datasets, but suggest models trained hierarchically at the document-level are more effective on noisy data. Analysis experiments demonstrate our hierarchical models are very robust to variations in the underlying sentence embedding quality. Using document embeddings trained with HiDE achieves state-of-the-art performance on United Nations (UN) parallel document mining, 94.9% P@1 1 for en-fr and 97.3% P@1 for en-es.

show abstract

Section: Introductionmentioning

confidence: 94%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Hierarchical Document Encoder for Parallel Corpus Mining

Guo¹,

Yang²,

Stevens³

et al. 2019

Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Baselines. We compare to several models from prior work (Guo et al, 2018;Chidambaram et al, 2018). A fair comparison to other models is difficult due to different training setups.…”

Section: Monolingual and Cross-lingual Similaritymentioning

confidence: 99%

“…Most previous work for cross-lingual representations has focused on models based on encoders from neural machine translation (Espana-Bonet et al, 2017;Schwenk and Douze, 2017;Schwenk, 2018) or deep architectures using a contrastive loss (Grégoire and Langlais, 2018;Guo et al, 2018;Chidambaram et al, 2018). However, the paraphrastic sentence embedding literature has observed that simple models such as pooling word embeddings generalize significantly better than complex architectures (Wieting et al, 2016b).…”

Section: Introductionmentioning

confidence: 99%