Proceedings of the Second Workshop on Computational Approaches to Code Switching 2016
DOI: 10.18653/v1/w16-5803
|View full text |Cite
|
Sign up to set email alerts
|

Word-Level Language Identification and Predicting Codeswitching Points in Swahili-English Language Data

Abstract: Codeswitching is a very common behavior among Swahili speakers, but of the little computational work done on Swahili, none has focused on codeswitching. This paper addresses two tasks relating to Swahili-English codeswitching: word-level language identification and prediction of codeswitch points. Our two-step model achieves high accuracy at labeling the language of words using a simple feature set combined with label probabilities on the adjacent words. This system is used to label a large Swahili-English int… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
23
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 21 publications
(23 citation statements)
references
References 10 publications
0
23
0
Order By: Relevance
“…Predicting C-S is important for modeling multilingual speech in NLP [19,20,21], in TTS [22,10], and in ASR [23,24,12,25,26,27,28,29]. Of particular importance for our work are findings from ASR that indicate that individual speakers or speakers from different nationalities show different patterns of Mandarin-English switching.…”
Section: Related Workmentioning
confidence: 99%
“…Predicting C-S is important for modeling multilingual speech in NLP [19,20,21], in TTS [22,10], and in ASR [23,24,12,25,26,27,28,29]. Of particular importance for our work are findings from ASR that indicate that individual speakers or speakers from different nationalities show different patterns of Mandarin-English switching.…”
Section: Related Workmentioning
confidence: 99%
“…However, for the SPA-ENG data set the system by (Shirvani et al, 2016) was the best performing at both the tweet and token level evaluations. On the other hand, the system by (Samih et al, 2016) was the best performing at both tweet and token level for the MSA-DA data set.…”
Section: Resultsmentioning
confidence: 93%
“…The best performing system here was (Shirvani et al, 2016) with an Avg-F-measure of 91.3%, which is Table 6: Tweet level performance results. We ranked the systems using the weighted average F-measure, Weighted-F1.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…These methods are language-dependent and require large annotated datasets or comprehensive dictionary of the target languages. For instance, some of the recent studies such as (Barman, Wagner, Vyas, Gella, Sharma, Bali, & Choudhury, 2014;Chrupala, & Foster, 2014;Dias Cardoso & Roy, 2016;Gella, Sharma, & Bali, 2013;Lavergne, Adda, Adda-Decker, & Lamel, 2014;Piergallini, Shirvani, Gautam, & Chouikha, 2016;Rijhwani, Sequiera, Choudhury, Bali, & Maddila, 2017;Barman, Das, Wagner, & Foster, 2014) used dictionary-based methods for LID at word level. While other studies such as (Banerjee et al, 2014(Banerjee et al, , 2014Chittaranjan, Vyas, Bali, & Choudhury, 2014;Dahiya, 2017;Das & Gambäck, 2014;Jaech, Mulcaire, Hathi, Ostendorf, & Smith, 2016;Jhamtani, Bhogi, & Raychoudhury, 2014;King & Abney, 2013;Mandal, Banerjee, Naskar, Rosso, & Bandyopadhyay, 2015;Nguyen & Dogruoz, 2013;Řehŭřek & Kolkus, 2009) used a combination of at least two of the following methods: dictionary-based methods, rule-based methods, character n-gram modelling and heuristics based on word level features modelling.…”
Section: Introductionmentioning
confidence: 99%