Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018
DOI: 10.18653/v1/p18-1143
|View full text |Cite
|
Sign up to set email alerts
|

Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data

Abstract: Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language. We present a computational technique for creation of grammatically valid artificial CM data based on the Equivalence Constraint Theory. We show that when training examples are sampled appropriately from this synthetic data and presented in certain order (aka training curriculum) along with monolingual and real C… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

3
110
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 99 publications
(113 citation statements)
references
References 31 publications
3
110
0
Order By: Relevance
“…In a recent work (Pratapa et al, 2018), we presented a methodology to generate linguistic theory based synthetic CM data (gCM) and showed its effectiveness in CM language modeling. Synthetic CM data was generated by employing Equivalence Constraint (EC) theory (Poplack, 1980;Sankoff, 1998).…”
Section: Bilingual Skip-gram Model (Biskip)mentioning
confidence: 99%
See 1 more Smart Citation
“…In a recent work (Pratapa et al, 2018), we presented a methodology to generate linguistic theory based synthetic CM data (gCM) and showed its effectiveness in CM language modeling. Synthetic CM data was generated by employing Equivalence Constraint (EC) theory (Poplack, 1980;Sankoff, 1998).…”
Section: Bilingual Skip-gram Model (Biskip)mentioning
confidence: 99%
“…In this paper, we compare three popular bilingual word embedding techniques (Sec 2): Bilingual correlation based embeddings (BiCCA) (Faruqui and Dyer, 2014), Bilingual compositional model (BiCVM) (Hermann and Blunsom, 2014) and Bilingual Skip-gram (BiSkip) (Luong et al, 2015) on two tasks for CM text -sentiment analysis, a semantic task, and POS tagging, a syntactic task. On the same tasks, we also compare word embeddings learnt from synthetic CM data (generated using linguistic models as proposed in a recent work (Pratapa et al, 2018)) (Sec 3). Note that Wick et al (2016) use artificial code mixed data to learn multilingual embeddings for cross-lingual tasks, but their aim is to generate bilingual embeddings for monolingual or cross-lingual tasks.…”
Section: Introductionmentioning
confidence: 99%
“…Generating synthesized CS data using only monolingual data has been also explored in [11,12,13,14], however, they only address the textual data scarcity problem. Figure 1 illustrates the baseline E2E-CS-ASR model based on hybrid CTC/Attention architecture [15] which incorporates the advantages of both Connectionist Temporal Classification (CTC) model [16] and attention-based encoder-decoder model [17].…”
Section: Related Workmentioning
confidence: 99%
“…Linguistic studies show that bilingual speakers switch languages by following various complex constraints [18,17] which may even include the intensity of sentiment expressed in various segments of text [23]. [20] synthesized code-mixed sentences by leveraging linguistic constraints arising from Equivalence Constraint Theory. While this works well for language pairs with good structural correspondence (like English-Spanish), we observe performance degrades with weaker correspondence (like English-Hindi).…”
Section: Introductionmentioning
confidence: 99%