2015
DOI: 10.1007/s10579-015-9317-4
|View full text |Cite
|
Sign up to set email alerts
|

TweetLID: a benchmark for tweet language identification

Abstract: Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (i) distinction of similar languages, (ii) detection of multilingualism in a single document, and (iii) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
26
0
2

Year Published

2016
2016
2024
2024

Publication Types

Select...
4
3
2

Relationship

3
6

Authors

Journals

citations
Cited by 45 publications
(28 citation statements)
references
References 30 publications
0
26
0
2
Order By: Relevance
“…Two shared tasks focused on a narrow group of languages using Twitter data. The first was TweetLID, a shared task on LI of Twitter messages according to six languages in common use in Spain, namely: Spanish, Portuguese, Catalan, English, Galician, and Basque (in order of the number of documents in the dataset) (Zubiaga et al, , 2016. The organizers provided almost 35,000 Twitter messages, and in addition to the six monolingual tags, supported four additional categories: undetermined, multilingual (i.e.…”
Section: Shared Tasksmentioning
confidence: 99%
“…Two shared tasks focused on a narrow group of languages using Twitter data. The first was TweetLID, a shared task on LI of Twitter messages according to six languages in common use in Spain, namely: Spanish, Portuguese, Catalan, English, Galician, and Basque (in order of the number of documents in the dataset) (Zubiaga et al, , 2016. The organizers provided almost 35,000 Twitter messages, and in addition to the six monolingual tags, supported four additional categories: undetermined, multilingual (i.e.…”
Section: Shared Tasksmentioning
confidence: 99%
“…Two specific tasks for language identification have attracted a lot of research attention in recent years, namely discriminating among closely related languages (Malmasi et al, 2016) and language detection on noisy short texts such as tweets (Zubiaga et al, 2015). The Discriminating between Similar Languages (DSL) workshop (Zampieri et al, 2014;Zampieri et al, 2015;Goutte et al, 2016) is a shared task where participants are asked to train systems to discriminate between similar languages, language varieties, and dialects.…”
Section: Language Identification and Similar Languagesmentioning
confidence: 99%
“…10 https://github.com/azubiaga/korrika15 11 https://dev.twitter.com/overview/api/tweets 12 https://github.com/azubiaga/twitter-tools 13 https://github.com/pablobarbera/pytwools…”
Section: Analisiamentioning
confidence: 99%
“…Twitterrek ez duenez ematen batutako txio guztien artean euskarazko txioak zein diren jakiteko aukera, hizkuntza-identifikazio hori egin dezakeen tresna lantzea da etorkizunerako ezinbesteko beharretako bat. Aurretik, horrelako tresna bat lantzeko asmoz, tartean Euskal Herriko Unibertsitatearekin eta Elhuyarrekin landutako TweetLID atazan [13] egin genituen lehen saiakerak eta anotatutako datu-bilduma eskaini genuen, ikertzaileek euskarazko txioen identifikazioa gehiago landu zezaten.…”
Section: Erabiltzaileen Analisiaunclassified