2022
DOI: 10.48550/arxiv.2204.08398
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Abstract: Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 12 publications
0
2
0
Order By: Relevance
“…Chakravarthi et al ( 2021) and have released datasets encompassing Tamil-English and Malayalam-English code-mixed texts. Nayak and Joshi (2022) have made available Hing-Corpus, a Hindi-English code-mix dataset, and also open-sourced pre-trained models trained on codemix corpora. Srivastava and Singh (2021) provide HinGE, a dataset for the generation and evaluation of code-mixed Hinglish text, and demonstrate techniques for algorithmically creating synthetic Hindi code-mixed texts.…”
Section: Related Workmentioning
confidence: 99%
“…Chakravarthi et al ( 2021) and have released datasets encompassing Tamil-English and Malayalam-English code-mixed texts. Nayak and Joshi (2022) have made available Hing-Corpus, a Hindi-English code-mix dataset, and also open-sourced pre-trained models trained on codemix corpora. Srivastava and Singh (2021) provide HinGE, a dataset for the generation and evaluation of code-mixed Hinglish text, and demonstrate techniques for algorithmically creating synthetic Hindi code-mixed texts.…”
Section: Related Workmentioning
confidence: 99%
“…We annotate the words in A using a code-mixed language identification tool. Specifically, we use L3Cube-HingLID (Nayak and Joshi, 2022) for this task. A word w i ∈ A can take either of the three language tags from the set {English, Hindi, Other}.…”
Section: Token-level Language Annotation (Tla)mentioning
confidence: 99%