2020
DOI: 10.1007/978-981-15-1275-9_5
|View full text |Cite
|
Sign up to set email alerts
|

Indian Language Identification for Short Text

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
2
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(3 citation statements)
references
References 12 publications
0
3
0
Order By: Relevance
“…Language recognition tools (such as langid.py [ 8 ]) solve this problem and use different classification algorithms to solve the problem at the sentence level. Many methods have been utilized in [ 9 – 12 ] to handle the problem of classifying code-mixing by using different frameworks, such as n -gram [ 13 ], Malayalam-English used Bi-LSTM and Hindi-English used KNN in [ 14 , 15 ], parts of speech (POS) [ 16 ] on multiple languages pairs, hidden Markov model [ 17 ], combined Support Vector Machine and CRFs [ 2 ] applied on code-mixed languages pairs such as Spanish-English [ 18 ], Dutch-Turkish [ 19 ], Maltese-English [ 20 ], Romanized Arabic Moroccan (Darija), French-English [ 21 ], current standard Egyptian-Arabic dialect [ 22 ], English-Mandarin [ 23 , 24 ], and English-Malay [ 25 ]. Balazevic et al in [ 26 ] presented the integration of user-specific information to enhance the recognition of Twitter dataset in 16 languages.…”
Section: Related Workmentioning
confidence: 99%
“…Language recognition tools (such as langid.py [ 8 ]) solve this problem and use different classification algorithms to solve the problem at the sentence level. Many methods have been utilized in [ 9 – 12 ] to handle the problem of classifying code-mixing by using different frameworks, such as n -gram [ 13 ], Malayalam-English used Bi-LSTM and Hindi-English used KNN in [ 14 , 15 ], parts of speech (POS) [ 16 ] on multiple languages pairs, hidden Markov model [ 17 ], combined Support Vector Machine and CRFs [ 2 ] applied on code-mixed languages pairs such as Spanish-English [ 18 ], Dutch-Turkish [ 19 ], Maltese-English [ 20 ], Romanized Arabic Moroccan (Darija), French-English [ 21 ], current standard Egyptian-Arabic dialect [ 22 ], English-Mandarin [ 23 , 24 ], and English-Malay [ 25 ]. Balazevic et al in [ 26 ] presented the integration of user-specific information to enhance the recognition of Twitter dataset in 16 languages.…”
Section: Related Workmentioning
confidence: 99%
“…Firstly, code-mixed data involves switching between languages, where different languages are used interchangeably within a sentence or word. These language switches may follow speci c patterns or be in uenced by contextual factors (Withanage et al 2020;Attia et al 2019) Secondly, code-mixed text often incorporates non-standard spellings, abbreviations, acronyms, emoticons, hashtags, and other informal elements commonly found in social media conversations (Bhaskaran et al 2020). These linguistic phenomena pose challenges to existing POS tagging techniques, which heavily rely on formal language patterns and resources.…”
Section: Introductionmentioning
confidence: 99%
“…BharatBhasaNet marks a milestone in LID technology. As digital technologies, such as translation, ASR, and conversational interfaces, grow in importance [10], [12], [13], BharatBhasaNet emerges as a vital tool in creating resources for less commonly spoken languages [11]. It effectively navigates challenges such as web data noise, sparse datasets [12], and similarities among high-resource languages [14].…”
mentioning
confidence: 99%