Learning bilingual word embeddings with (almost) no bilingual data

Artetxe, Mikel; Labaka, Gorka; Agirre, Eneko

doi:10.18653/v1/p17-1042

Cited by 407 publications

(535 citation statements)

References 20 publications

Supporting

Mentioning

533

Contrasting

Order By: Relevance

“…With these considerations in mind, one should wonder how stable the representation of names can be in an embedding space. This question has previously been raised by Artetxe et al (2017). We address it empirically below.…”

Section: Analysis Of Pos Compositionmentioning

confidence: 90%

“…what is the ratio of True Positives to the sum of True Positives and False Positives. 1 Available at https://github.com/coastalcph/MUSE_dicos Data All systems listed above report results on one or both of two test sets: the MUSE test sets Conneau et al (2018) and/or the Dinu test sets (Dinu et al, 2015;Artetxe et al, 2017). Similarly to MUSE, the Dinu dataset was compiled automatically (from Europarl word-alignments), but it only covers four languages.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary Induction

Kementchedjhieva

Hartmann

Søgaard

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

The task of bilingual dictionary induction (BDI) is commonly used for intrinsic evaluation of cross-lingual word embeddings. The largest dataset for BDI was generated automatically, so its quality is dubious. We study the composition and quality of the test sets for five diverse languages from this dataset, with concerning findings: (1) a quarter of the data consists of proper nouns, which can be hardly indicative of BDI performance, and (2) there are pervasive gaps in the gold-standard targets. These issues appear to affect the ranking between cross-lingual embedding systems on individual languages, and the overall degree to which the systems differ in performance. With proper nouns removed from the data, the margin between the top two systems included in the study grows from 3.4% to 17.2%. Manual verification of the predictions, on the other hand, reveals that gaps in the gold standard targets artificially inflate the margin between the two systems on English to Bulgarian BDI from 0.1% to 6.7%. We thus suggest that future research either avoids drawing conclusions from quantitative results on this BDI dataset, or accompanies such evaluation with rigorous error analysis.

show abstract

Section: Analysis Of Pos Compositionmentioning

confidence: 90%

Section: Introductionmentioning

confidence: 99%

Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary Induction

Kementchedjhieva

Hartmann

Søgaard

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

show abstract

“…A common method of improving BLI is iteratively expanding the dictionary and refining the mapping matrix as a post-processing step (Artetxe et al, 2017;Lample et al, 2018). Given a learnt mapping matrix, Procrustes refinement first finds * W X denotes the set {W x :…”

Section: Iterative Procrustes Refinement and Hubness Mitigationmentioning

confidence: 99%

Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces

Patra¹,

Moniz²,

Garg³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Recent work on bilingual lexicon induction (BLI) has frequently depended either on aligned bilingual lexicons or on distribution matching, often with an assumption about the isometry of the two spaces. We propose a technique to quantitatively estimate this assumption of the isometry between two embedding spaces and empirically show that this assumption weakens as the languages in question become increasingly etymologically distant. We then propose Bilingual Lexicon Induction with Semi-Supervision (BLISS) -a semi-supervised approach that relaxes the isometric assumption while leveraging both limited aligned bilingual lexicons and a larger set of unaligned word embeddings, as well as a novel hubness filtering technique. Our proposed method obtains state of the art results on 15 of 18 language pairs on the MUSE dataset, and does particularly well when the embedding spaces don't appear to be isometric. In addition, we also show that adding supervision stabilizes the learning procedure, and is effective even with minimal supervision. *

show abstract

“…The bilingual dictionaries used in the word embedding alignment contained several thousands of word pairs, and the recent study by Artetxe et al (2017) shows that the dictionary size we operate with should be large enough to reach a high level of alignment accuracy.…”

Section: Cross-lingual Dependency Parsing Modelmentioning

confidence: 99%

Dependency Parsing of Code-Switching Data with Cross-Lingual Feature Representations

Partanen

Lim

Rießler

et al. 2018

Proceedings of the Fourth International Workshop on Computatinal Linguistics of Uralic Languages

View full text Add to dashboard Cite

This paper describes the test of a dependency parsing method which is based on bidirectional LSTM feature representations and multilingual word embedding, and evaluates the results on mono-and multilingual data. The results are similar in all cases, with a slightly better results achieved using multilingual data. The languages under investigation are Komi-Zyrian and Russian. Examination of the results by relation type shows that some language specific constructions are correctly recognized even when they appear in naturally occurring code-switching data. TiivistelmäTutkimus arvioi dependenssianalyysin menetelmää, joka perustuu kaksisuuntaiseen LSTM-piirrerepresentaatioon ja monikieliseen 'word embedding' -malliin, sekä arvioi tuloksia yksi-ja monikielisissä aineistoissa. Tulokset ovat samantapaisia, mutta hieman korkeampia moni-kuin yksikielisissä aineistoissa. Tutkitut kielet ovat komisyrjääni ja venäjä. Tulosten yksityiskohtaisempi analyysi riippuvuuksien mukaan osoittaa, että tietyt kielikohtaiset suhteet on tunnistettu oikein jopa niiden esiintyessä luonnollisissa koodinvaihtoa sisältävissä lauseissa.This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details:

show abstract

Learning bilingual word embeddings with (almost) no bilingual data

Cited by 407 publications

References 20 publications

Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary Induction

Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary Induction

Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces

Dependency Parsing of Code-Switching Data with Cross-Lingual Feature Representations

Contact Info

Product

Resources

About