Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture

Mandal, Soumil; Singh, Anil Kumar

doi:10.18653/v1/w18-6116

Cited by 18 publications

(17 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An interesting research direction is token-level LID for code-mixed texts (Zhang et al 2018;Mager, Çetinoglu, and Kann 2019;Mandal and Singh 2018). However, fine-grained LID has marginal assistance for CLIR task, since the downstream modules, e.g.…”

Section: Query Language Identificationmentioning

confidence: 99%

Effective Approaches to Neural Query Language Identification

Ren

Yang

Liu

et al. 2022

Computational Linguistics

View full text Add to dashboard Cite

Query language identification (Q-LID) plays a crucial role in cross-lingual search engine. There exist two main challenges in Q-LID: 1) insufficient contextual information in queries for disambiguation; and 2) the lack of query-style training examples for low-resource languages. In this paper, we propose a neural Q-LID model by alleviating the above problems from both model architecture and data augmentation perspectives. Concretely, we build our model upon the advanced Transformer model. In order to enhance the discrimination of queries, a variety of external features, e.g. character, word as well as script, are fed into the model and fused by a multi-scale attention mechanism. Moreover, to remedy the low resource challenge in this task, a novel machine translation based strategy is proposed to automatically generate synthetic query-style data for low-resource languages. We contribute the first Q-LID test set called QID-21, which consists of search queries in 21 languages. Experimental results reveal that our model yields better classification accuracy than strong baselines and existing LID systems on both query and traditional LID tasks.

show abstract

Section: Query Language Identificationmentioning

confidence: 99%

Effective Approaches to Neural Query Language Identification

Ren

Yang

Liu

et al. 2022

Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Recent works [40] have proposed diverse paths for WLLI. One of them used a multichannel [four channels] neural network, which included three convolutional 1-D networks and a LSTM network; it was tested on Bengali-English and Hindi-English language pairs.…”

Section: Related Workmentioning

confidence: 99%

“…The model achieved impressive accuracy scores of 93.28% on the Bengali data set, and 93.32% on the Hindi data set. Another approach used character encoding and root phone encoding [40] to train LSTM models. Jamatia et al [41] addressed WLLI of codemixed language pairs -English-Hindi and Bengali-Englishfrom Facebook, Twitter, WhatsApp social media sites.…”

Section: Related Workmentioning

confidence: 99%

Transformer Based Language Identification for Malayalam-English Code-Mixed Text

Thara

Poornachandran

2021

IEEE Access

View full text Add to dashboard Cite

Social media users have the proclivity to write majority of the data for under resourced languages in code-mixed format. Code-mixing is defined as mixing of two or more languages in a single sentence. Research in code-mixed text helps apprehend security threats, prevalent on social media platforms. In such instances, language identification is an imperative task of code-mixed text. The focus of this paper is to carry out a word-level language identification (WLLI) of Malayalam-English code-mixed data, from social media platforms like YouTube. This study was centered around BERT, a transformer model, along with its variants -CamemBERT, DistilBERT -for intuitive perception of the language at the word-level. The propounded approach entails tagging Malayalam-English code-mixed data set with six labels: Malayalam (mal), English (eng), acronyms (acr), universal (univ), mixed (mix) and undefined (undef). Newly developed corpus of Malayalam-English was deployed for appraisal of the effectiveness of state-of-the-art models like BERT. Evaluation of the proffered approach, accomplished with other code-mixed language such as Hindi-English, notched a 9% increase in the F1-score.INDEX TERMS Natural language processing, language identification, bidirectional encoder representations from transformers (BERT), text mining, corpus preparation, deep learning, code-mixed

show abstract

“…Sequence classification using RNN has been used in Samih et al (2016). A multichannel convolutional neural network (CNN) combined with a bidirectional long short-term memory (BiLSTM)-CRF module has been used in Mandal and Singh (2018). Sequence to sequence models has also been used in Jurgens et al (2017).…”

Section: Related Studiesmentioning

confidence: 99%

SwitchNet: Learning to switch for word-level language identification in code-mixed social media text

2021

View full text Add to dashboard Cite

Word-level language identification is an essential prerequisite for extracting useful information from code-mixed social media content. Previous studies in word-level language identification show two important observations. First, the local context is an important indicator of the language of a word when a word is valid in multiple languages. Second, considering the word in isolation from its context leads to more effective language classification when a word is borrowed or embedded into sentences of other languages. In this paper, we propose a framework for language identification that makes use of a dynamic switching mechanism for effective language classification of both words that are borrowed or embedded from other languages as well as words that are valid in multiple languages. For a given input, the proposed switching mechanism makes a dynamic decision to bias its prediction either towards the prediction obtained by the contextual information or that obtained by the word in isolation. In contrast to existing studies that rely upon large amounts of annotated data for robust performance in a multilingual environment, the proposed approach uses minimal annotated resources and no external resources, making it easily extendible to newer languages. Evaluation over a corpus of transliterated Facebook comments shows that the proposed approach outperforms its baseline counterparts: classification based on the contextual information, classification based on the word in isolation, as well as an ensemble of the two classifiers.

show abstract

Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture

Cited by 18 publications

References 15 publications

Effective Approaches to Neural Query Language Identification

Effective Approaches to Neural Query Language Identification

Transformer Based Language Identification for Malayalam-English Code-Mixed Text

SwitchNet: Learning to switch for word-level language identification in code-mixed social media text

Contact Info

Product

Resources

About