Metrics for Modeling Code-Switching Across Corpora

Guzmán, Gualberto A.; Ricard, Joseph; Serigos, Jacqueline; Bullock, Barbara E.; Toribio, Almeida Jacqueline

doi:10.21437/interspeech.2017-1429

Cited by 45 publications

(46 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent studies have focused on empirical measurements of code-switching (Guzmán et al, 2017). The multilingual index(M-Index), Language Entropy and Integration index(I-index) measure the extent of mixing and switching frequency.…”

Section: Discussionmentioning

confidence: 99%

Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques

Chandu

Loginova

Gupta³

et al. 2018

Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

View full text Add to dashboard Cite

Code-Mixing (CM) is the phenomenon of alternating between two or more languages which is prevalent in bi-and multilingual communities. Most NLP applications today are still designed with the assumption of a single interaction language and are most likely to break given a CM utterance with multiple languages mixed at a morphological, phrase or sentence level. For example, popular commercial search engines do not yet fully understand the intents expressed in CM queries. As a first step towards fostering research which supports CM in NLP applications, we systematically crowd-sourced and curated an evaluation dataset for factoid question answering in three CM languages-Hinglish (Hindi+English), Tenglish (Telugu+English) and Tamlish (Tamil+English) which belong to two language families. We share the details of our data collection process, techniques which were used to avoid inducing lexical bias amongst the crowd workers and other CM specific linguistic properties of the dataset. Our final dataset, which is available freely for research purposes, has 1,694 Hinglish, 2,848 Tamlish and 1,391 Tenglish factoid questions and their answers. We discuss the techniques used by the participants for the first edition of this ongoing challenge.

show abstract

Section: Discussionmentioning

confidence: 99%

Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques

Chandu

Loginova

Gupta³

et al. 2018

Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

View full text Add to dashboard Cite

show abstract

“…Here we analyse the features like length distribution and diversity of code-switching of generated synthetic texts. We also measured one sentence level metric Code-Mixing Index (CMI) coined by [10], and three corpus level metrics Multilingual index (M-Index), Burstiness and Span Entropy that were introduced in [13] to demonstrate how different the generated texts are from the training corpus in terms of switching.…”

Section: Direct/intrinsic Evaluationmentioning

confidence: 99%

A Deep Generative Model for Code Switched Text

Samanta

Reddy

Jagirdar

et al. 2019

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

View full text Add to dashboard Cite

Code-switching, the interleaving of two or more languages within a sentence or discourse is pervasive in multilingual societies. Accurate language models for code-switched text are critical for NLP tasks. Stateof-the-art data-intensive neural language models are difficult to train well from scarce language-labeled code-switched text. A potential solution is to use deep generative models to synthesize large volumes of realistic code-switched text. Although generative adversarial networks and variational autoencoders can synthesize plausible monolingual text from continuous latent space, they cannot adequately address codeswitched text, owing to their informal style and complex interplay between the constituent languages. We introduce VACS, a novel variational autoencoder architecture specifically tailored to code-switching phenomena. VACS encodes to and decodes from a two-level hierarchical representation, which models syntactic contextual signals in the lower level, and language switching signals in the upper layer. Sampling representations from the prior and decoding them produced well-formed, diverse code-switched sentences. Extensive experiments show that using synthetic code-switched text with natural monolingual data results in significant (33.06%) drop in perplexity.

show abstract

“…• Importantly, bilingual speech practices are complex and it is not clear that the traditional binary typology of insertional and alternational C-S, while useful as a heuristic, is adequate to characterize the nature of C-S (Auer and Muhamedova, 2005). There have been recent attempts to quantify mixing complexity with the aim of arriving at empirically reliable comparisons of C-S between corpora (Gambäck and Das, 2016;Das and Gambäck, 2014;Jamatia et al, 2015;Guzman et al, 2016;Guzmán et al, 2017a). Each aims to capture the fact that C-S may vary along multiple planes.…”

Section: • Sentencementioning

confidence: 99%

Predicting the presence of a Matrix Language in code-switching

Bullock

Guzmán²,

Serigos

et al. 2018

Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

Self Cite

View full text Add to dashboard Cite

One language is often assumed to be dominant in code-switching (C-S), but this assumption has not been empirically tested. We operationalize the matrix language (ML) at the level of the sentence, using three common definitions. We test whether these converge and then model this convergence via a set of metrics that together quantify the nature of C-S. We conduct our experiment on four different Spanish-English corpora. Our results demonstrate that our model can separate some corpora according to whether they have a dominant ML or not but that the corpora span a range of mixing types that cannot be sorted neatly into an insertional vs. alternational dichotomy.

show abstract

Metrics for Modeling Code-Switching Across Corpora

Cited by 45 publications

References 22 publications

Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques

Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques

A Deep Generative Model for Code Switched Text

Predicting the presence of a Matrix Language in code-switching

Contact Info

Product

Resources

About