Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2016
DOI: 10.18653/v1/n16-1132
|View full text |Cite
|
Sign up to set email alerts
|

Bootstrapping Translation Detection and Sentence Extraction from Comparable Corpora

Abstract: Most work on extracting parallel text from comparable corpora depends on linguistic resources such as seed parallel documents or translation dictionaries. This paper presents a simple baseline approach for bootstrapping a parallel collection. It starts by observing documents published on similar dates and the cooccurrence of a small number of identical tokens across languages. It then uses fast, online inference for a latent variable model to represent multilingual documents in a shared topic space where it ca… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 14 publications
0
2
0
Order By: Relevance
“…Topic assignments are denoted as z, and w denotes observed tokens. extensively used and adapted in various ways for different crosslingual tasks (Krstovski and Smith 2011;Moens and Vulic 2013;Vulić and Moens 2014;Liu, Duh, and Matsumoto 2015;Krstovski and Smith 2016).…”
Section: Documentmentioning
confidence: 99%
See 1 more Smart Citation
“…Topic assignments are denoted as z, and w denotes observed tokens. extensively used and adapted in various ways for different crosslingual tasks (Krstovski and Smith 2011;Moens and Vulic 2013;Vulić and Moens 2014;Liu, Duh, and Matsumoto 2015;Krstovski and Smith 2016).…”
Section: Documentmentioning
confidence: 99%
“…Models that transfer knowledge on the document level have many variants, including SOFTLINK , comparable bilingual LDA (C-BILDA; Heyman, Vulic, and Moens 2016), the partially connected multilingual topic model (PCMLTM; Liu, Duh, and Matsumoto 2015), and multi-level hyperprior polylingual topic model (MLHPLTM; Krstovski, Smith, and Kurtz 2016). SOFTLINK generalizes DOCLINK by using a dictionary, so that documents can be linked based on overlap in their vocabulary, even if the corpus is not parallel or comparable.…”
Section: Documentmentioning
confidence: 99%