2005
DOI: 10.1162/089120105775299168
|View full text |Cite
|
Sign up to set email alerts
|

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Abstract: We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. W… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
107
0

Year Published

2006
2006
2019
2019

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 217 publications
(108 citation statements)
references
References 19 publications
1
107
0
Order By: Relevance
“…Translations of in-domain terms can be mined from comparable corpora, i.e. texts that are not strictly parallel but deal with the same topic [29,30]. Bertoldi and Federico [31] exploit large amounts of in-domain monolingual data to create synthetic parallel training corpora.…”
Section: Domain Adaptationmentioning
confidence: 99%
“…Translations of in-domain terms can be mined from comparable corpora, i.e. texts that are not strictly parallel but deal with the same topic [29,30]. Bertoldi and Federico [31] exploit large amounts of in-domain monolingual data to create synthetic parallel training corpora.…”
Section: Domain Adaptationmentioning
confidence: 99%
“…In their later seminal paper, Munteanu and Marcu (2005) improved their method by training a maximum entropy classifier which for a given pair of sentences can reliably determine whether or not they are translations of each other. They also showed empirically that a statistical MT system can be built from scratch by starting with a small parallel corpus of only 100,000 words and by expanding it using parallel segments as extracted from pairs of the very large Gigaword-Corpora (Arabic-English and Chinese-English).…”
Section: Mining Parallel Segments From Comparable Corporamentioning
confidence: 99%
“…Statistical machine translation, uses manually translated data in the forms of parallel sentences to learn translation patterns by statistical means. There has been extensive work focusing in finding parallel documents [14] and aligning sentences in fairly parallel corpora [8] and even non-parallel corpora [9]. [10] presents an approach to find sub-sentential segments from comparable corpora.…”
Section: Related Workmentioning
confidence: 99%