Simple task-specific bilingual word embeddings

Gouws, Stephan; Søgaard, Anders

doi:10.3115/v1/n15-1157

Cited by 102 publications

(116 citation statements)

References 10 publications

Supporting

Mentioning

111

Contrasting

Order By: Relevance

“…Since SWTC is a less difficult task which requires coarse-grained representations, even limited amounts of training data may be sufficient to learn word embeddings which are useful for the specific task. This finding is in line with the recent work of Gouws and Søgaard (2015).…”

Section: Bwesg Vs Baseline Representationssupporting

confidence: 83%

Bilingual Distributed Word Representations from Document-Aligned Comparable Data

Vulić

Moens

2016

jair

View full text Add to dashboard Cite

We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and contextcounting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.

show abstract

Section: Bwesg Vs Baseline Representationssupporting

confidence: 83%

Bilingual Distributed Word Representations from Document-Aligned Comparable Data

Vulić

Moens

2016

jair

View full text Add to dashboard Cite

show abstract

“…Most methods rely on supervision encoded in parallel data, at the document level (Vulić and Moens, 2015), the sentence level (Zou et al, 2013;Chandar A P et al, 2014;Hermann and Blunsom, 2014;Kočiský et al, 2014;Luong et al, 2015;Coulmance et al, 2015;Oshikiri et al, 2016), or the word level (i.e. in the form of seed lexicon) (Gouws and Søgaard, 2015;Wick et al, 2016;Duong et al, 2016;Shi et al, 2015;Mikolov et al, 2013a;Faruqui and Dyer, 2014;Lu et al, 2015;Ammar et al, 2016;Zhang et al, 2016aZhang et al, , 2017Smith et al, 2017).…”

Section: Bilingual Lexicon Inductionmentioning

confidence: 99%

Earth Mover's Distance Minimization for Unsupervised Bilingual Lexicon Induction

Zhang¹,

Liu²,

Luan³

et al. 2017

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

117

104

View full text Add to dashboard Cite

Cross-lingual natural language processing hinges on the premise that there exists invariance across languages. At the word level, researchers have identified such invariance in the word embedding semantic spaces of different languages. However, in order to connect the separate spaces, cross-lingual supervision encoded in parallel data is typically required. In this paper, we attempt to establish the cross-lingual connection without relying on any cross-lingual supervision. By viewing word embedding spaces as distributions, we propose to minimize their earth mover's distance, a measure of divergence between distributions. We demonstrate the success on the unsupervised bilingual lexicon induction task. In addition, we reveal an interesting finding that the earth mover's distance shows potential as a measure of language difference.

show abstract

“…Trying to find such representations for a large multilingual vocabulary can thus become computationally prohibitive. Some attempts have recently been made in this direction, by leveraging multilingual external resources such as Wikipedia articles (Al-Rfou', Perozzi, & Skiena, 2013), or bilingual dictionaries (Gouws & Søgaard, 2015), or word-aligned parallel corpora (Klementiev, Titov, & Bhattarai, 2012), or sentence-aligned parallel corpora (Zou, Socher, Cer, & Manning, 2013;Hermann & Blunsom, 2014;Lauly, Boulanger, & Larochelle, 2014;Chandar, Lauly, Larochelle, Khapra, Ravindran, Raykar, & Saha, 2014), or document-aligned parallel corpora (Vulić & Moens, 2015). However, such external resources may not always be available for all language combinations and, when they are available (e.g., Wikipedia articles), they may be of uneven quality and quantity for languages other than English.…”

Section: Distributional Representationsmentioning

confidence: 99%

Lightweight Random Indexing for Polylingual Text Classification

Moreo¹,

Esuli²,

Sebastiani

2016

jair

View full text Add to dashboard Cite

Multilingual Text Classification (MLTC) is a text classification task in which documents are written each in one among a set L of natural languages, and in which all documents must be classified under the same classification scheme, irrespective of language. There are two main variants of MLTC, namely Cross-Lingual Text Classification (CLTC) and Polylingual Text Classification (PLTC). In PLTC, which is the focus of this paper, we assume (differently from CLTC) that for each language in L there is a representative set of training documents; PLTC consists of improving the accuracy of each of the |L| monolingual classifiers by also leveraging the training documents written in the other (|L| − 1) languages. The obvious solution, consisting of generating a single polylingual classifier from the juxtaposed monolingual vector spaces, is usually infeasible, since the dimensionality of the resulting vector space is roughly |L| times that of a monolingual one, and is thus often unmanageable. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or are not always free to use.One machine-translation-free and dictionary-free method that, to the best of our knowledge, has never been applied to PLTC before, is Random Indexing (RI). We analyse RI in terms of space and time efficiency, and propose a particular configuration of it (that we dub Lightweight Random Indexing -LRI). By running experiments on two well known public benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallel one), we show LRI to outperform (both in terms of effectiveness and efficiency) a number of previously proposed machine-translation-free and dictionary-free PLTC methods that we use as baselines.

show abstract

Simple task-specific bilingual word embeddings

Cited by 102 publications

References 10 publications

Bilingual Distributed Word Representations from Document-Aligned Comparable Data

Bilingual Distributed Word Representations from Document-Aligned Comparable Data

Earth Mover's Distance Minimization for Unsupervised Bilingual Lexicon Induction

Lightweight Random Indexing for Polylingual Text Classification

Contact Info

Product

Resources

About