2019
DOI: 10.1609/aaai.v33i01.33019951
|View full text |Cite
|
Sign up to set email alerts
|

Mind Your Language: Abuse and Offense Detection for Code-Switched Languages

Abstract: In multilingual societies like the Indian subcontinent, use of code-switched languages is much popular and convenient for the users. In this paper, we study offense and abuse detection in the code-switched pair of Hindi and English (i.e. Hinglish), the pair that is the most spoken. The task is made difficult due to non-fixed grammar, vocabulary, semantics and spellings of Hinglish language. We apply transfer learning and make a LSTM based model for hate speech classification. This model surpasses the performan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3
2

Relationship

3
7

Authors

Journals

citations
Cited by 17 publications
(11 citation statements)
references
References 3 publications
0
10
0
Order By: Relevance
“…To understand the racial and dialectic bias in toxic language detection, we focus our analyses on two corpora of tweets (Davidson et al, 2017;Founta et al, 2018) that are widely used in hate speech detection (Park et al, 2018;van Aken et al, 2018;Kapoor et al, 2018;Alorainy et al, 2018 DWMW17 (Davidson et al, 2017) includes annotations of 25K tweets as hate speech, offensive (but not hate speech), or none. The authors collected data from Twitter, starting with 1,000 terms from HateBase (an online database of hate speech terms) as seeds, and crowdsourced at least three annotations per tweet.…”
Section: Biases In Toxic Language Datasetsmentioning
confidence: 99%
“…To understand the racial and dialectic bias in toxic language detection, we focus our analyses on two corpora of tweets (Davidson et al, 2017;Founta et al, 2018) that are widely used in hate speech detection (Park et al, 2018;van Aken et al, 2018;Kapoor et al, 2018;Alorainy et al, 2018 DWMW17 (Davidson et al, 2017) includes annotations of 25K tweets as hate speech, offensive (but not hate speech), or none. The authors collected data from Twitter, starting with 1,000 terms from HateBase (an online database of hate speech terms) as seeds, and crowdsourced at least three annotations per tweet.…”
Section: Biases In Toxic Language Datasetsmentioning
confidence: 99%
“…On a similar direction there has been work on understanding the main intentions behind vulgar expressions in social media (Holgate et al, 2018). Various approaches have been taken to tackle both textual as well as multimodal data from Twitter and social media in general, in order to build deep learning classifiers for similar tasks (Baghel et al, 2018;Kapoor et al, 2018;Mahata et al, 2018a,b;Jangid et al, 2018;Meghawat et al, 2018;Shah and Zimmermann, 2017). The dataset provided for the tasks was collected through Twitter API by searching for tweets containing certain selected keyword patterns popular in offensive posts.…”
Section: Related Workmentioning
confidence: 99%
“…Yet, recent works exhibit efforts towards the diversification of the objects of study. Datasets are created for less-studied languages such as Hinglish [61,139], Bengali [147], and Arabic [62,123], revealing new challenges pertaining to the particular language structures (e.g., in Hinglish, the grammar is not fixed, the written words use Roman script for spoken works in Hindi [139], a list of challenges for Arabic is proposed in Al-Hassan et al [5]); and for less-common social media platforms (e.g., YouTube comments [62,147]).…”
Section: Data Retrievalmentioning
confidence: 99%