2014
DOI: 10.3233/fi-2014-987
|View full text |Cite
|
Sign up to set email alerts
|

An Efficient Framework for Extracting Parallel Sentences from Non-Parallel Corpora

Abstract: Automatically building a large bilingual corpus that contains millions of words is always a challenging task. In particular in case of low-resource languages, it is difficult to find an existing parallel corpus which is large enough for building a real statistical machine translation. However, comparable non-parallel corpora are richly available in the Internet environment, such as in Wikipedia, and from which we can extract valuable parallel texts. This work presents a framework for effectively extracting par… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2016
2016
2016
2016

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 24 publications
(31 reference statements)
0
1
0
Order By: Relevance
“…Tillmann and Xu (2009) and Smith et al (2010) detect parallel sentences by training IBM Model 1 and maximum entropy classifiers, respectively. In later work on detecting sentence and phrase translation pairs, Cettolo et al (2010) and Hoang et al (2014) use SMT systems to translate candidate documents; Quirk et al (2007) use parallel data to train a translation equivalence model; and Ture and Lin (2012) use a translation lexicon to build a scoring function for parallel documents. More recently, Ling et al (2013) trained IBM Model 1 on bitext to detect translationally equivalent phrase pairs within single microblog posts.…”
Section: Prior Work On Comparable Corporamentioning
confidence: 99%
“…Tillmann and Xu (2009) and Smith et al (2010) detect parallel sentences by training IBM Model 1 and maximum entropy classifiers, respectively. In later work on detecting sentence and phrase translation pairs, Cettolo et al (2010) and Hoang et al (2014) use SMT systems to translate candidate documents; Quirk et al (2007) use parallel data to train a translation equivalence model; and Ture and Lin (2012) use a translation lexicon to build a scoring function for parallel documents. More recently, Ling et al (2013) trained IBM Model 1 on bitext to detect translationally equivalent phrase pairs within single microblog posts.…”
Section: Prior Work On Comparable Corporamentioning
confidence: 99%