Finding translations for low-frequency words in comparable corpora

Pekar, Viktor; Mitkov, Ruslan; Blagoev, Dimitar; Mulloni, Andrea

doi:10.1007/s10590-007-9029-7

Cited by 18 publications

(11 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some of these works include dictionary learning and identifying word translations (Rapp 1995;Fung and Yee 1998;Sadat et al 2003;Pekar et al 2006;Xabier et al 2008) finding translation equivalents (Bennison and Bowker 2000;Chiao and Zweigenbaum 2002;Sharoff et al 2006) named entity translation/transliteration (Huang et al 2005;Alegria et al 2006;Sproat and Zhai 2006), extracting phrasal alignments (Kumano et al 2007), mining name translations (Ji 2009), word sense disambiguation (Kaji 2003), parallel fragment extraction (Quirk et al 2007), cross language IR (CLIR) (Talvensaari 2008), extracting lay paraphrases of specialized expressions (Deléger and Zweigenbaum 2009), language and translation model adaptation (Hildebrand et al 2005;Snover et al 2008) and improving SMT performance using extracted parallel sentences (Munteanu and Marcu 2005;Schwenk 2009a, 2009b;Lu et al 2010). This article describes a method for exploiting comparable news corpora to produce more parallel texts and eventually improve SMT system performance.…”

Section: Introductionmentioning

confidence: 99%

Parallel sentence generation from comparable corpora for improved SMT

Rauf¹,

Schwenk²

2011

Machine Translation

View full text Add to dashboard Cite

A parallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages. An approach is presented here which aims at producing parallel corpora from available comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail removal. The approach was applied to Arabic-English and French-English systems using comparable news corpora and considerable improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is easier to implement and gives equally good improvements.

show abstract

Section: Introductionmentioning

confidence: 99%

Parallel sentence generation from comparable corpora for improved SMT

Rauf¹,

Schwenk²

2011

Machine Translation

View full text Add to dashboard Cite

show abstract

“…Most of the work in this line (Rapp 1999;Fung and McKeown 1997;Bouamor et al 2012)), including our own work (Pekar et al 2006), covers single words and not multiword expressions. According to the distributional similarity premise, translation equivalents share common words in their contexts and this applies also to multiword expressions.…”

Section: Translation Of Multiword Expresions: Methodology and Evaluationmentioning

confidence: 92%

Computational Phraseology light: automatic translation of multiword expressions without translation resources

Mitkov

2016

Yearbook of Phraseology

Self Cite

View full text Add to dashboard Cite

This paper describes the first phase of a project whose ultimate goal is the implementation of a practical tool to support the work of language learners and translators by automatically identifying multiword expressions (MWEs) and retrieving their translations for any pair of languages. The task of translating multiword expressions is viewed as a two-stage process. The first stage is the extraction of MWEs in each of the languages; the second stage is a matching procedure for the extracted MWEs in each language which proposes the translation equivalents.This project pursues the development of a knowledge-poor approach for any pair of languages which does not depend on translation resources such as dictionaries, translation memories or parallel corpora which can be time consuming to develop or difficult to acquire, being expensive or proprietary. In line with this philosophy, the methodology developed does not rely on any dictionaries or parallel corpora, nor does it use any (bilingual) grammars. The only information comes from comparable corpora, inexpensively compiled. The first proofof-concept stage of this project covers English and Spanish and focuses on a particular subclass of MWEs: verb-noun expressions (collocations) such as take advantage, make sense, prestar atención and tener derecho.The choice of genre was determined by the fact that newswire is a widespread genre and available in different languages. An additional motivation was the fact that the methodology was developed as language independent with the objective of applying it to and testing it for different languages. The ACCURAT toolkit (Pinnis et al. 2012; Skadina et al. 2012; Su and Babych 2012a) was employed to compile automatically the comparable corpora and documents only above a specific threshold were considered for inclusion. More specifically, only pairs of English and Spanish documents with comparability score (cosine similarity) higher 0.45 were extracted.1 1 However, see section 6 which discusses experiments with different comparability scores.Ruslan Mitkov, Research Institute in Information and Language Processing, University of Wolverhampton, R.Mitkov@wlv.ac.uk Ruslan MitkovStatistical association measures were employed to quantify the strength of the relationship between two words and to propose that a combination of a verb and a noun above a specific threshold would be a (candidate for) multiword expression. This study focused on and compared four popular and established measures along with frequency: Log-likelihood ratio, T-Score, Log Dice and Salience.This project follows the distributional similarity premise which stipulates that translation equivalents share common words in their contexts and this applies also to multiword expressions. The Vector Space Model is traditionally used to represent words with their co-occurrences and to measure similarity. The vector representation for any word is constructed from the statistics of the occurrences of that word with other specific/context words in a corpus of texts. In thi...

show abstract

“…Figure 6 shows that words that appear with higher frequency in our monolingual corpora tend to be translated better. Pekar et al (2006) also investigated the effects of frequency on finding translations from comparable copper. This makes sense since we have more robust statistics when constructing their vector representations.…”

Section: Learning Translations Of Unseen Wordsmentioning

confidence: 99%

End-to-end statistical machine translation with zero or small parallel texts

Irvine

Callison-Burch

2016

Nat. Lang. Eng.

View full text Add to dashboard Cite

We use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.

show abstract

Finding translations for low-frequency words in comparable corpora

Cited by 18 publications

References 23 publications

Parallel sentence generation from comparable corpora for improved SMT

Parallel sentence generation from comparable corpora for improved SMT

Computational Phraseology light: automatic translation of multiword expressions without translation resources

End-to-end statistical machine translation with zero or small parallel texts

Contact Info

Product

Resources

About