Learning Translations via Matrix Completion

Wijaya, Derry; Callahan, Brendan; Hewitt, John; Gao, Jie; Ling, Xiao; Apidianaki, Marianna; Callison-Burch, Chris

doi:10.18653/v1/d17-1152

Cited by 13 publications

(5 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Upadhyay et al (2016) obtain evaluation sets for the task across 26 languages from the Open Multilingual WordNet (Bond & Foster, 2013), while Levy et al (2017) obtain bilingual dictionaries from Wiktionary for Arabic, Finnish, Hebrew, Hungarian, and Turkish. More recently Wijaya, Callahan, Hewitt, Gao, Ling, Apidianaki, and Callison-Burch (2017) build evaluation data for 28 language pairs (where English is always the target language) by semi-automatically translating all Wikipedia words with frequency above 100. Most previous work (Vulić & Moens, 2013a;Mikolov et al, 2013b) filters source and target words based on part-of-speech, though this simplifies the task and introduces bias in the evaluation.…”

Section: Extrinsic Tasksmentioning

confidence: 99%

A Survey of Cross-lingual Word Embedding Models

Ruder¹,

Vulić²,

Søgaard³

2019

jair

468

330

View full text Add to dashboard Cite

Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.

show abstract

Section: Extrinsic Tasksmentioning

confidence: 99%

A Survey of Cross-lingual Word Embedding Models

Ruder¹,

Vulić²,

Søgaard³

2019

jair

468

330

View full text Add to dashboard Cite

show abstract

“…To combat the issue of data starvation, many researchers aim to utilize monolingual data to train NMT systems (Lample et al, 2018a;Artetxe et al, 2018;Conneau and Lample, 2019) and find ways to generate more training data, either comparable or synthetic data. Comparable data are extracted using various bitext retrieval methods (Zhao and Vogel, 2002;Fan et al, 2021;Kocyigit et al, 2022), multimodal signals (Hewitt et al, 2018;Rasooli et al, 2021), dictionary-or knowledge-based approaches (Wijaya and Mitchell, 2016;Wijaya et al, 2017;Tang and Wijaya, 2022); while synthetic data are created and utilized either through innovative training data augmentation (Kuwanto et al, 2021), utilizing automatic backtranslation (Sennrich et al, 2016a;Wang et al, 2019), or even outright generating synthetic data using generative models (Lu et al, 2023), which has gained increasing attention by the community lately due to the advancement of large language models (LLMs).…”

Section: Augmenting Training For Nmtmentioning

confidence: 99%

Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia

Susanto,

Diandaru,

Krisnadhi

et al. 2023

Proceedings of the First Workshop in South East Asian Language Processing

View full text Add to dashboard Cite

Neural machine translation (NMT) for lowresource local languages in Indonesia faces significant challenges, including the need for a representative benchmark and limited data availability. This work addresses these challenges by comprehensively analyzing training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese. Our study encompasses various training approaches, paradigms, data sizes, and a preliminary study into using large language models for synthetic low-resource languages parallel data generation. We reveal specific trends and insights into practical strategies for low-resource language translation. Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances, rivaling the translation quality of zeroshot gpt-3.5-turbo. These findings significantly advance NMT for low-resource languages, offering valuable guidance for researchers in similar contexts.

show abstract

“…So, a word in the target language is a translation candidate of a word in the source language if it tends to co-occur with the pairs of words from the seed words. A slightly different strategy is reported in Wijaya et al (2017), where the learning task is modeled as a matrix completion problem with source words in the columns and target words in the rows. More precisely, starting from some observed translations (e.g., from existing bilingual dictionaries), the method infers missing translations in the matrix using matrix factorization with a Bayesian Personalized Ranking.…”

Section: Cross-lingual Word Similarity From Monolingual Corporamentioning

confidence: 99%

Contextualized Translations of Phrasal Verbs with Distributional Compositional Semantics and Monolingual Corpora

Gamallo

Sotelo

Campos

et al. 2019

Computational Linguistics

View full text Add to dashboard Cite

This article describes a compositional distributional method to generate contextualized senses of words and identify their appropriate translations in the target language using monolingual corpora. Word translation is modeled in the same way as contextualization of word meaning, but in a bilingual vector space. The contextualization of meaning is carried out by means of distributional composition within a structured vector space with syntactic dependencies, and the bilingual space is created by means of transfer rules and a bilingual dictionary. A phrase in the source language, consisting of a head and a dependent, is translated into the target language by selecting both the nearest neighbor of the head given the dependent, and the nearest neighbor of the dependent given the head. This process is expanded to larger phrases by means of incremental composition. Experiments were performed on English and Spanish monolingual corpora in order to translate phrasal verbs in context. A new bilingual data set to evaluate strategies aimed at translating phrasal verbs in restricted syntactic domains has been created and released.

show abstract

Learning Translations via Matrix Completion

Cited by 13 publications

References 32 publications

A Survey of Cross-lingual Word Embedding Models

A Survey of Cross-lingual Word Embedding Models

Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia

Contextualized Translations of Phrasal Verbs with Distributional Compositional Semantics and Monolingual Corpora

Contact Info

Product

Resources

About