2022
DOI: 10.1007/s10579-022-09583-7
|View full text |Cite
|
Sign up to set email alerts
|

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Abstract: This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
25
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 58 publications
(25 citation statements)
references
References 53 publications
0
25
0
Order By: Relevance
“…For classifying CMCS data, Machine Learning approaches such as Logistic Regression, Support Vector Machines, Multinomial Naive Bayes, K-Nearest Neighbors, Decision Trees, and Random Forest have been used by early research (Chakravarthi et al, 2022). Later, Deep Learning techniques such as CNN, LSTM, and BiLSTM (Kamble and Joshi, 2018) have been widely used.…”
Section: Cmcs Text Classificationmentioning
confidence: 99%
See 3 more Smart Citations
“…For classifying CMCS data, Machine Learning approaches such as Logistic Regression, Support Vector Machines, Multinomial Naive Bayes, K-Nearest Neighbors, Decision Trees, and Random Forest have been used by early research (Chakravarthi et al, 2022). Later, Deep Learning techniques such as CNN, LSTM, and BiLSTM (Kamble and Joshi, 2018) have been widely used.…”
Section: Cmcs Text Classificationmentioning
confidence: 99%
“…CMCS can appear in various forms, including code-switching, inter-sentential, and intra-sentential code-mixing, and texts written in both Latin and native scripts. CMCS text classification corpora have been mainly created with respect to Indian languages such as Hindi-English (Bohra et al, 2018), Telugu-English (Gundapu and Mamidi, 2018), Tamil-English, Kannada-English, and Malayalam-English (Chakravarthi et al, 2022), while there are some corpora for other CMCS languages such as Sinhala-English (Smith and Thayasivam, 2019), Spanish-English (Vilares et al, 2016) and Arabic-English (Sabty et al, 2019). Except for the dataset created by Chakravarthi et al (2022), others have removed the text written in native script and considered only a limited type of code-mixing levels.…”
Section: Annotated Corpora For Cmcs Text Classificationmentioning
confidence: 99%
See 2 more Smart Citations
“…We make use of the multi-lingual dataset, Dravidian-CodeMix 2 [57], consisting of over 60,000 manually annotated YouTube comments. The data set comprises sentences from 3 code-mixed Dravidian languages: Kannada [58], Malayalam [14], and Tamil [6,43].…”
Section: Datasetmentioning
confidence: 99%