Proceedings of the First Workshop on Computational Approaches to Code Switching 2014
DOI: 10.3115/v1/w14-3902
|View full text |Cite
|
Sign up to set email alerts
|

Code Mixing: A Challenge for Language Identification in the Language of Social Media

Abstract: In social media communication, multilingual speakers often switch between languages, and, in such an environment, automatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of creating, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi. We also present … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
157
1

Year Published

2016
2016
2022
2022

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 202 publications
(159 citation statements)
references
References 20 publications
1
157
1
Order By: Relevance
“…Barman, Das et al [25] uses social media data for language identification in mixed script and concluded in favor of supervised learning against the dictionary-based approaches. Nagesh and Ravi [26] gave a way to perform language identification using multi class regression classifiers and was able to get nearly 54% accuracy.…”
Section: Balamurali and Joshimentioning
confidence: 99%
“…Barman, Das et al [25] uses social media data for language identification in mixed script and concluded in favor of supervised learning against the dictionary-based approaches. Nagesh and Ravi [26] gave a way to perform language identification using multi class regression classifiers and was able to get nearly 54% accuracy.…”
Section: Balamurali and Joshimentioning
confidence: 99%
“…In other research works, some ambiguity is left with regard to the words that are present in both English and Bengali either by removing them (Das and Gambäck, 2013) or by classifying them as mixed (Depending on suffixes or word-level mixing) (Barman et al, 2014). However, such ambiguity needs to be removed, if we are required to utilize such type of data for further analysis or use them for building models of sentiment and/or predictive analysis, since people generally use mixed or ambiguous words in some single language context as well, which is why they code-mix in the first place.…”
Section: Related Workmentioning
confidence: 99%
“…In both of the other research works mentioned, the groups composed their own corpus from a Facebook group and the posts and comments by members (Das and Gambäck, 2013;Barman et al, 2014). Both of the groups also use N-gram pruning and dictionary checks.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…(Sharma et al, 2016) addressed the problem of shallow parsing of HindiEnglish code-mixed social media text and developed a system for Hindi-English code-mixed text that can identify the language of the words, normalize them to their standard forms, assign them their POS tag and segment into chunks. (Barman et al, 2014) addressed the problem of language identification on Bengali-Hindi-English Facebook comments. They annotated a corpus and achieved an accuracy of 95.76% using statistical models with monolingual dictionaries.…”
Section: Introductionmentioning
confidence: 99%