Proceedings of the Second Workshop on Computational Approaches to Code Switching 2016
DOI: 10.18653/v1/w16-5806
|View full text |Cite
|
Sign up to set email alerts
|

Multilingual Code-switching Identification via LSTM Recurrent Neural Networks

Abstract: This paper describes the HHU-UH-G system submitted to the EMNLP 2016 Second Workshop on Computational Approaches to Code Switching. Our system ranked first place for Arabic (MSA-Egyptian) with an F1-score of 0.83 and second place for Spanish-English with an F1-score of 0.90. The HHU-UH-G system introduces a novel unified neural network architecture for language identification in code-switched tweets for both SpanishEnglish and MSA-Egyptian dialect. The system makes use of word and character level representatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
43
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 58 publications
(44 citation statements)
references
References 19 publications
1
43
0
Order By: Relevance
“…σ1σ2 (9) and is bounded within the interval [-1,1]. Memory values close to -1 describe the tendency for consecutive language spans to be negatively autocorrelated, differing substantially in length; that is, long spans of discourse are followed by short spans of discourse, and short spans are followed by long spans.…”
Section: Memorymentioning
confidence: 99%
See 1 more Smart Citation
“…σ1σ2 (9) and is bounded within the interval [-1,1]. Memory values close to -1 describe the tendency for consecutive language spans to be negatively autocorrelated, differing substantially in length; that is, long spans of discourse are followed by short spans of discourse, and short spans are followed by long spans.…”
Section: Memorymentioning
confidence: 99%
“…The global rise of social media such as Facebook, Twitter, SMS, and Usenet newsgroups has afforded large quantities of user-generated data that incorporates C-S [5,6,7,8,9]. However, the occurrence of multiple languages within a single text presents significant complexity for automated processing.…”
Section: Introductionmentioning
confidence: 99%
“…The task of LID for CS has been frequently studied in the last years (Al-Badrashiny and Diab, 2016;Rijhwani et al, 2017;Zhang et al, 2018), including two shared tasks on the topic (Solorio et al, 2014;Molina et al, 2016). The best systems (Samih et al, 2016;Shirvani et al, 2016) achieved over 90% accuracy for all language pairs. However, intra-word CS was not handled explicitly, and often systems even failed to correctly assign the mixed label.…”
Section: Related Workmentioning
confidence: 99%
“…Second, a subword-level model segments words with composed language ID tags. For word-level tagging, we use a hierarchical bidirectional LSTM (BiLSTM) that incorporates both token-and character-level information (Plank et al, 2016), similar to the winning system (Samih et al, 2016) of the Second Code-Switching Shared Task (Molina et al, 2016). 4 For the subword level, we use two supervised segmentation methods: a CRF segmenter proposed by Ruokolainen et al (2013), that models segmentation as a labeling problem and a sequence-to-sequence (Seq2Seq) model trained with an auxiliary task as proposed by Kann et al (2018).…”
Section: Baselinesmentioning
confidence: 99%
“…Training and decoding are performed by the Viterbi algorithm. Note that replacing the softmax with CRF at the output layer in neural networks has proved to be very fruitful in many sequence labeling tasks (Ma and Hovy, 2016;Huang et al, 2015;Lample et al, 2016;Samih et al, 2016) …”
Section: Conditional Random Fields (Crf)mentioning
confidence: 99%