Integrating an Unsupervised Transliteration Model into Statistical Machine Translation

Durrani, Nadir; Sajjad, Hassan; Hoang, Hieu; Koehn, Philipp

doi:10.3115/v1/e14-4029

Cited by 67 publications

(63 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We could not train a transliteration system due to unavailability of a transliteration training data. This year we used an EM-based method to induce unsupervised transliteration models (Durrani et al, 2014b). We extracted transliteration pairs automatically from the word-aligned parallel data and used it to learn a transliteration system.…”

Section: Unsupervised Transliteration Modelmentioning

confidence: 99%

Edinburgh’s Phrase-based Machine Translation Systems for WMT-14

Durrani

Haddow²,

Koehn³

et al. 2014

Proceedings of the Ninth Workshop on Statistical Machine Translation

Self Cite

View full text Add to dashboard Cite

This paper describes the University of Edinburgh's (UEDIN) phrase-based submissions to the translation and medical translation shared tasks of the 2014 Workshop on Statistical Machine Translation (WMT). We participated in all language pairs. We have improved upon our 2013 system by i) using generalized representations, specifically automatic word clusters for translations out of English, ii) using unsupervised character-based models to translate unknown words in RussianEnglish and Hindi-English pairs, iii) synthesizing Hindi data from closely-related Urdu data, and iv) building huge language on the common crawl corpus. Translation TaskOur baseline systems are based on the setup described in (Durrani et al., 2013b) that we used for the Eighth Workshop on Statistical Machine Translation (Bojar et al., 2013). The notable features of these systems are described in the following section. The experiments that we carried out for this year's translation task are described in the following sections. BaselineWe trained our systems with the following settings: a maximum sentence length of 80, growdiag-final-and symmetrization of GIZA++ alignments, an interpolated Kneser-Ney smoothed 5-gram language model with KenLM (Heafield, 2011) and Chiang, 2007), with a stack-size of 1000 during tuning and 5000 during test and the noreordering-over-punctuation heuristic (Koehn and Haddow, 2009). We used POS and morphological tags as additional factors in phrase translation models for GermanEnglish language pairs. We also trained target sequence models on the in-domain subset of the parallel corpus using Kneser-Ney smoothed 7-gram models. We used syntactic-preordering (Collins et al., 2005) and compound splitting (Koehn and Knight, 2003) for German-to-English systems. We used trivia tokenizer for tokenizing Hindi.The systems were tuned on a very large tuning set consisting of the test sets from 2008-2012, with a total of 13,071 sentences. We used newstest 2013 for the dev experiments. For RussianEnglish pairs news-test 2012 was used for tuning and for Hindi-English pairs, we divided the newsdev 2014 into two halves, used the first half for tuning and second for dev experiments. Using Generalized Word RepresentationsWe explored the use of automatic word clusters in phrase-based models (Durrani et al., 2014a). We computed the clusters with GIZA++'s mkcls (Och, 1999) on the source and target side of the parallel training corpus. Clusters are word classes that are optimized to reduce n-gram perplexity. By generating a cluster identifier for each output word, we are able to add an n-gram model 97

show abstract

Section: Unsupervised Transliteration Modelmentioning

confidence: 99%

Edinburgh’s Phrase-based Machine Translation Systems for WMT-14

Durrani

Haddow²,

Koehn³

et al. 2014

Proceedings of the Ninth Workshop on Statistical Machine Translation

Self Cite

View full text Add to dashboard Cite

show abstract

“…We used the post-decoding transliteration option with this tool. UTM uses a transliteration phrase translation table to evaluate and score multiple possible transliterations (Durrani et al, 2014).…”

Section: Data Pre-processingmentioning

confidence: 99%

PJIIT’s systems for WMT 2017 Conference

Wołk¹,

Marasek²

2017

Proceedings of the Second Conference on Machine Translation

View full text Add to dashboard Cite

In this paper, we attempt to improve Statistical Machine Translation (SMT) systems between Czech, Latvian and English in WNT'17 News translation task. We also participated in the Biomedical task and produces translation engines from English into Polish, Czech, German, Spanish, French, Hungarian, Romanian and Swedish. To accomplish this, we performed translation model training, created adaptations of training settings for each language pair, and implemented BPE (subword units) for our SMT systems. Innovative tools and data adaptation techniques were employed. Only the official parallel text corpora and monolingual models for the WMT 2017 evaluation campaign were used to train language models, and to develop, tune, and test the system. We explored the use of domain adaptation techniques, symmetrized word alignment models, the unsupervised transliteration models and the KenLM language modeling tool. To evaluate the effects of different preparations on translation results, we conducted experiments and used the BLEU, NIST and TER metrics. Our results indicate that our approach produced a positive impact on SMT quality.

show abstract

“…Hence, we mine transliteration corpora for 110 language pairs from the ILCI corpus, a parallel translation corpora of 11 Indian languages (Jha, 2012). Transliteration pairs are mined using the unsupervised approach proposed by Sajjad et al (2012) and implemented in the Moses SMT system (Durrani et al, 2014). Their approach models parallel translation corpus generation as a generative process comprising an interpolation of a transliteration and a non-transliteration process.…”

Section: Transliteration Miningmentioning

confidence: 99%

Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent

Kunchukuttan¹,

Puduppully²,

Bhattacharyya³

2015

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstratio

View full text Add to dashboard Cite

We present Brahmi-Net -an online system for transliteration and script conversion for all major Indian language pairs (306 pairs). The system covers 13 Indo-Aryan languages, 4 Dravidian languages and English. For training the transliteration systems, we mined parallel transliteration corpora from parallel translation corpora using an unsupervised method and trained statistical transliteration systems using the mined corpora. Languages which do not have parallel corpora are supported by transliteration through a bridge language. Our script conversion system supports conversion between all Brahmi-derived scripts as well as ITRANS romanization scheme. For this, we leverage co-ordinated Unicode ranges between Indic scripts and use an extended ITRANS encoding for transliterating between English and Indic scripts. The system also provides top-k transliterations and simultaneous transliteration into multiple output languages. We provide a Python as well as REST API to access these services. The API and the mined transliteration corpus are made available for research use under an open source license.

show abstract

Integrating an Unsupervised Transliteration Model into Statistical Machine Translation

Cited by 67 publications

References 11 publications

Edinburgh’s Phrase-based Machine Translation Systems for WMT-14

Edinburgh’s Phrase-based Machine Translation Systems for WMT-14

PJIIT’s systems for WMT 2017 Conference

Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent

Contact Info

Product

Resources

About