Multilingual Code-switching Identification via LSTM Recurrent Neural Networks

Samih, Younes; Maharjan, Suraj; Attia, Mohammed; Kallmeyer, Laura; Solorio, Thamar

doi:10.18653/v1/w16-5806

Cited by 58 publications

(44 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…σ1σ2 (9) and is bounded within the interval [-1,1]. Memory values close to -1 describe the tendency for consecutive language spans to be negatively autocorrelated, differing substantially in length; that is, long spans of discourse are followed by short spans of discourse, and short spans are followed by long spans.…”

Section: Memorymentioning

confidence: 99%

See 1 more Smart Citation

Metrics for Modeling Code-Switching Across Corpora

Guzmán¹,

Ricard²,

Serigos

et al. 2017

Interspeech 2017

View full text Add to dashboard Cite

In developing technologies for code-switched speech, it would be desirable to be able to predict how much language mixing might be expected in the signal and the regularity with which it might occur. In this work, we offer various metrics that allow for the classification and visualization of multilingual corpora according to the ratio of languages represented, the probability of switching between them, and the time-course of switching. Applying these metrics to corpora of different languages and genres, we find that they display distinct probabilities and periodicities of switching, information useful for speech processing of mixed-language data.

show abstract

Section: Memorymentioning

confidence: 99%

“…The global rise of social media such as Facebook, Twitter, SMS, and Usenet newsgroups has afforded large quantities of user-generated data that incorporates C-S [5,6,7,8,9]. However, the occurrence of multiple languages within a single text presents significant complexity for automated processing.…”

Section: Introductionmentioning

confidence: 99%

Metrics for Modeling Code-Switching Across Corpora

Guzmán¹,

Ricard²,

Serigos

et al. 2017

Interspeech 2017

View full text Add to dashboard Cite

show abstract

“…The task of LID for CS has been frequently studied in the last years (Al-Badrashiny and Diab, 2016;Rijhwani et al, 2017;Zhang et al, 2018), including two shared tasks on the topic (Solorio et al, 2014;Molina et al, 2016). The best systems (Samih et al, 2016;Shirvani et al, 2016) achieved over 90% accuracy for all language pairs. However, intra-word CS was not handled explicitly, and often systems even failed to correctly assign the mixed label.…”

Section: Related Workmentioning

confidence: 99%

“…Second, a subword-level model segments words with composed language ID tags. For word-level tagging, we use a hierarchical bidirectional LSTM (BiLSTM) that incorporates both token-and character-level information (Plank et al, 2016), similar to the winning system (Samih et al, 2016) of the Second Code-Switching Shared Task (Molina et al, 2016). 4 For the subword level, we use two supervised segmentation methods: a CRF segmenter proposed by Ruokolainen et al (2013), that models segmentation as a labeling problem and a sequence-to-sequence (Seq2Seq) model trained with an auxiliary task as proposed by Kann et al (2018).…”

Section: Baselinesmentioning

confidence: 99%

Subword-Level Language Identification for Intra-Word Code-Switching

Mager

Çetinoğlu

Kann

2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

Language identification for code-switching (CS), the phenomenon of alternating between two or more languages in conversations, has traditionally been approached under the assumption of a single language per token. However, if at least one language is morphologically rich, a large number of words can be composed of morphemes from more than one language (intra-word CS). In this paper, we extend the language identification task to the subword level, such that it includes splitting mixed words while tagging each part with a language ID. We further propose a model for this task, which is based on a segmental recurrent neural network. In experiments on a new Spanish-Wixarika dataset and on an adapted German-Turkish dataset, our proposed model performs slightly better than or roughly on par with our best baseline, respectively. Considering only mixed words, however, it strongly outperforms all baselines.

show abstract

“…Training and decoding are performed by the Viterbi algorithm. Note that replacing the softmax with CRF at the output layer in neural networks has proved to be very fruitful in many sequence labeling tasks (Ma and Hovy, 2016;Huang et al, 2015;Lample et al, 2016;Samih et al, 2016) …”

Section: Conditional Random Fields (Crf)mentioning

confidence: 99%

A Neural Architecture for Dialectal Arabic Segmentation

Samih¹,

Attia²,

Eldesouki³

et al. 2017

Proceedings of the Third Arabic Natural Language Processing Workshop

Self Cite

View full text Add to dashboard Cite

The automated processing of Arabic dialects is challenging due to the lack of spelling standards and the scarcity of annotated data and resources in general. Segmentation of words into their constituent tokens is an important processing step for natural language processing. In this paper, we show how a segmenter can be trained on only 350 annotated tweets using neural networks without any normalization or reliance on lexical features or linguistic resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that heavily depend on additional resources.

show abstract

Multilingual Code-switching Identification via LSTM Recurrent Neural Networks

Cited by 58 publications

References 19 publications

Metrics for Modeling Code-Switching Across Corpora

Metrics for Modeling Code-Switching Across Corpora

Subword-Level Language Identification for Intra-Word Code-Switching

A Neural Architecture for Dialectal Arabic Segmentation

Contact Info

Product

Resources

About