Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media 2018
DOI: 10.18653/v1/w18-3503
|View full text |Cite
|
Sign up to set email alerts
|

A Twitter Corpus for Hindi-English Code Mixed POS Tagging

Abstract: Code-mixing is a linguistic phenomenon where multiple languages are used in the same occurrence that is increasingly common in multilingual societies. Codemixed content on social media is also on the rise, prompting the need for tools to automatically understand such content. Automatic Parts-of-Speech (POS) tagging is an essential step in any Natural Language Processing (NLP) pipeline, but there is a lack of annotated data to train such models. In this work, we present a unique language tagged and POS-tagged d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
35
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 46 publications
(36 citation statements)
references
References 12 publications
0
35
0
1
Order By: Relevance
“…We use our system to backtransliterate the Hindi English corpora from the LinCE 6 benchmark . The NER corpus is from Singh et al (2018a) and has 2,079 tweets while the POS tagging corpus is from Singh et al (2018b) and has 1,489 tweets. Some statistics about the datasets are presented in Table 7.…”
Section: Released Datasetsmentioning
confidence: 99%
“…We use our system to backtransliterate the Hindi English corpora from the LinCE 6 benchmark . The NER corpus is from Singh et al (2018a) and has 2,079 tweets while the POS tagging corpus is from Singh et al (2018b) and has 1,489 tweets. Some statistics about the datasets are presented in Table 7.…”
Section: Released Datasetsmentioning
confidence: 99%
“…Different from the previous approaches, Aguilar and Solorio (2020) use language identification to create a code-switching ELMo from English ELMo (Peters et al, 2018). Later they show the effectiveness of their CS-ELMo by achieving state-of-theart POS tagging results on a Hindi-English dataset (Singh et al, 2018). They also employ multi-task learning where their auxiliary task is language identification with a simplified LID tag set for LID, POS, and NER tagging.…”
Section: Related Workmentioning
confidence: 99%
“…We evaluate our models on five downstream tasks in the LinCE Benchmark (Aguilar et al, 2020a). We choose three named entity recognition (NER) tasks, Hindi-English (HIN-ENG) , Spanish-English (SPA-ENG) (Aguilar et al, 2018) and Modern Standard Arabic (MSA-EA) (Aguilar et al, 2018), and two part-of-speech (POS) tagging tasks, Hindi-English (HIN-ENG) (Singh et al, 2018b) and Spanish-English (SPA-ENG) (Soto and Hirschberg, 2017). We apply Roman-to-Devanagari transliteration on the Hindi-English datasets since the multilingual models are trained with data using that form.…”
Section: Datasetsmentioning
confidence: 99%