Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings) 2023
DOI: 10.18653/v1/2023.findings-ijcnlp.22
|View full text |Cite
|
Sign up to set email alerts
|

My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

Tanmay Chavan,
Omkar Gokhale,
Aditya Kane
et al.

Abstract: The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained language models. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining. We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus. Furthermore, for benchmar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Publication Types

Select...

Relationship

0
0

Authors

Journals

citations
Cited by 0 publications
references
References 13 publications
0
0
0
Order By: Relevance

No citations

Set email alert for when this publication receives citations?