2020
DOI: 10.1609/aaai.v34i05.6500
|View full text |Cite
|
Sign up to set email alerts
|

Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Abstract: Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (caco) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2

Relationship

3
4

Authors

Journals

citations
Cited by 13 publications
(8 citation statements)
references
References 27 publications
0
8
0
Order By: Relevance
“…Overall, the best method to generalize the generation model across languages is to use machine-translated data. (Ammar et al, 2016;Cotterell and Heigold, 2017;Ahmad et al, 2019;Lin et al, 2019;Zhang et al, 2020a). We are also interested in training on both organic data and MT data; i.e., mixing the zero-shot and MT setting.…”
Section: Results On Other Languagesmentioning
confidence: 99%
“…Overall, the best method to generalize the generation model across languages is to use machine-translated data. (Ammar et al, 2016;Cotterell and Heigold, 2017;Ahmad et al, 2019;Lin et al, 2019;Zhang et al, 2020a). We are also interested in training on both organic data and MT data; i.e., mixing the zero-shot and MT setting.…”
Section: Results On Other Languagesmentioning
confidence: 99%
“…However, CLTC assumes that source labeled data are available, and the source and target tasks are identical (Upadhyay et al, 2016;Karamanolakis et al, 2020;Xu et al, 2016;Bel et al, 2003). Since the data requirement of CLTC can be restrictive, there exist recent methods for weakly supervised CLTC where target labels are not required (Karamanolakis et al, 2020;Xu et al, 2016;Zhang et al, 2020a). Note that source labeled data are still required for such methods.…”
Section: Related Workmentioning
confidence: 99%
“…Cross-Lingual Document Classification. Prior approaches transfer knowledge with cross-lingual resources, such as bilingual dictionaries (Wu et al, 2008;Shi et al, 2010), parallel text (Xu and Yang, 2017), labeled data from related languages (Zhang et al, 2020a), structural correspondences (Peter Prettenhofer, 2010), multilingual topic models (Ni et al, 2011;Andrade et al, 2015), machine translation (Wan, 2009;Zhou et al, 2016), and CLWE (Klementiev et al, 2012). Our method instead brings a bilingual speaker in the loop to actively provide cross-lingual knowledge, which is more reliable in low-resource settings.…”
Section: Related Workmentioning
confidence: 99%
“…These climes help sentiment analysis. Zhang et al, 2020a). We develop CLassifying Interactively with Multilingual Embeddings (CLIME), that efficiently specializes CLWE with human interaction.…”
Section: Introductionmentioning
confidence: 99%