Proceedings of the First Workshop on Computational Approaches to Code Switching 2014
DOI: 10.3115/v1/w14-3912
|View full text |Cite
|
Sign up to set email alerts
|

The IUCL+ System: Word-Level Language Identification via Extended Markov Models

Abstract: We describe the IUCL+ system for the shared task of the First Workshop on Computational Approaches to Code Switching (Solorio et al., 2014), in which participants were challenged to label each word in Twitter texts as a named entity or one of two candidate languages. Our system combines character n-gram probabilities, lexical probabilities, word label transition probabilities and existing named entity recognition tools within a Markov model framework that weights these components and assigns a label. Our appro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2014
2014
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(8 citation statements)
references
References 6 publications
0
8
0
Order By: Relevance
“…As an example, some gapped 4-grams from the word Sterneberg would be Senb, tree, enbr , and reeg. King et al (2014b) used character n-grams as a backoff from Markovian word n-grams. Shrestha (2014) used the frequencies of word initial n-grams ranging from 3 to the length of the word minus 1.…”
Section: Character Repetitionmentioning
confidence: 99%
“…As an example, some gapped 4-grams from the word Sterneberg would be Senb, tree, enbr , and reeg. King et al (2014b) used character n-grams as a backoff from Markovian word n-grams. Shrestha (2014) used the frequencies of word initial n-grams ranging from 3 to the length of the word minus 1.…”
Section: Character Repetitionmentioning
confidence: 99%
“…• IUCL: The best results obtained by King et al (2014); • IIIT: The best results obtained by Jain and Bhat (2014); • CMU: The best results obtained by Lin et al (2014); • MSR-India: The best results obtained by Chittaranjan et al ( 2014); • AIDA: The best results obtained by us using the older version AIDA .…”
Section: Token Level Baselinesmentioning
confidence: 92%
“…Their system outperforms Zaidan and Callison-Burch (2011) and Elfardy and Diab (2013), achieving a classification accuracy of 89%, 79%, and 88% on the same Egyptian, Levantine and Gulf datasets. For token level dialect identification, King et al (2014) use a language-independent approach that utilizes character n-gram probabilities, lexical probabilities, word label transition probabilities and existing named entity recognition tools within a Markov model framework.…”
Section: Related Workmentioning
confidence: 99%
“…But here as well, the small number of labeled instances makes it hard to draw strong conclusions. (Barman et al, 2014) 0.84 0.85 0.31 0.82 0.03 0 0.823 (Chittaranjan et al, 2014) 0.94 0.86 0.14 0.83 0 0 0.824 (King et al, 2014) 0.84 0.85 0.35 0.81 0 0 0.828 (Bar and Dershowitz, 2014) 0.85 0.87 0.37 0.83 0.03 0 0.839 Table 4: Performance results on language identification at the token level. A '-' indicates there were no tokens of this class in the test set.…”
Section: Resultsmentioning
confidence: 99%
“…However, according to the system descriptions provided, not all systems used encoding information. The best performing systems for MAN-EN are (King et al, 2014) and (Chittaranjan et al, 2014). The former slightly outperformed the latter at the Tweet level (see Figure 1a) task while the opposite was true at the token level (see Table 4 rows 4 and 5).…”
Section: Resultsmentioning
confidence: 99%