TweetLID: a benchmark for tweet language identification

Zubiaga, Arkaitz; Vicente, Iñaki San; Gamallo, Pablo; Campos, José Ramom Pichel; Alegria, Iñaki; Aranberri, Nora; Ezeiza, Aitzol; Fresno, Víctor

doi:10.1007/s10579-015-9317-4

Cited by 45 publications

(28 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Two shared tasks focused on a narrow group of languages using Twitter data. The first was TweetLID, a shared task on LI of Twitter messages according to six languages in common use in Spain, namely: Spanish, Portuguese, Catalan, English, Galician, and Basque (in order of the number of documents in the dataset) (Zubiaga et al, , 2016. The organizers provided almost 35,000 Twitter messages, and in addition to the six monolingual tags, supported four additional categories: undetermined, multilingual (i.e.…”

Section: Shared Tasksmentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

104

View full text Add to dashboard Cite

Language identification ("LI") is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.LI as a task predates computational methods -the earliest interest in the area was motivated by the needs of translators, and simple manual methods were developed to quickly identify documents in specific languages. The earliest known work to describe a functional LI program for text is by Mustonen (1965), a statistician, who used multiple discriminant analysis to teach a computer how to distinguish, at the word level, between English, Swedish and Finnish. Mustonen compiled a list of linguistically-motivated character-based features, and trained his language identifier on 300 words for each of the three target languages. The training procedure created two discriminant functions, which were tested with 100 words for each language. The experiment resulted in 76% of the words being correctly classified; even by current standards this percentage would be seen as acceptable given the small amount of training material, although the composition of training and test data is not clear, making the experiment unreproducible.In the early 1970s, Nakamura (1971) considered the problem of automatic LI. According to Rau (1974) and the available abstract of Nakamura's article, 1 his language identifier was able to distinguish between 25 languages written with the Latin alphabet. As features, the method used the occurrence rates of characters and words in each language. From the abstract it seems that, in addition to the frequencies, he used some binary presence/absence features of particular characters or words, based on manual LI. Rau (1974) wrote his master's thesis "Language Identification by Statistical Analysis" for the Naval Postgraduate School at Monterey, California. The continued interest and the need to use LI of text in military intelligence settings is evidenced by the recent articles of, for example, Rafidha Rehiman et al. (2013), Rowe et al. (2013), and Voss et al. (2014. As features for LI, Rau (1974) used, e.g., the relative frequencies of characters and character bigrams. With a majority vote classifier ensemble of seven classifiers using Kolmogor-Smirnov's Test of Goodness of Fit and Yule's characteristic (K), he managed...

show abstract

Section: Shared Tasksmentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

104

View full text Add to dashboard Cite

show abstract

“…Two specific tasks for language identification have attracted a lot of research attention in recent years, namely discriminating among closely related languages (Malmasi et al, 2016) and language detection on noisy short texts such as tweets (Zubiaga et al, 2015). The Discriminating between Similar Languages (DSL) workshop (Zampieri et al, 2014;Zampieri et al, 2015;Goutte et al, 2016) is a shared task where participants are asked to train systems to discriminate between similar languages, language varieties, and dialects.…”

Section: Language Identification and Similar Languagesmentioning

confidence: 99%

A Perplexity-Based Method for Similar Languages Discrimination

Gamallo¹,

Campos²,

Alegria³

2017

Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

Self Cite

View full text Add to dashboard Cite

This article describes the system submitted by the Citius Ixa Imaxin team to the VarDial 2017 (DSL and GDI tasks). The strategy underlying our system is based on a language distance computed by means of model perplexity. The best model configuration we have tested is a voting system making use of several n-grams models of both words and characters, even if word unigrams turned out to be a very competitive model with reasonable results in the tasks we have participated. An error analysis has been performed in which we identified many test examples with no linguistic evidences to distinguish among the variants.

show abstract

“…10 https://github.com/azubiaga/korrika15 11 https://dev.twitter.com/overview/api/tweets 12 https://github.com/azubiaga/twitter-tools 13 https://github.com/pablobarbera/pytwools…”

Section: Analisiamentioning

confidence: 99%

“…Twitterrek ez duenez ematen batutako txio guztien artean euskarazko txioak zein diren jakiteko aukera, hizkuntza-identifikazio hori egin dezakeen tresna lantzea da etorkizunerako ezinbesteko beharretako bat. Aurretik, horrelako tresna bat lantzeko asmoz, tartean Euskal Herriko Unibertsitatearekin eta Elhuyarrekin landutako TweetLID atazan [13] egin genituen lehen saiakerak eta anotatutako datu-bilduma eskaini genuen, ikertzaileek euskarazko txioen identifikazioa gehiago landu zezaten.…”

Section: Erabiltzaileen Analisiaunclassified

Euskahaldun: euskararen aldeko martxa baten sare sozialetako islaren bilketa eta analisia

Zubiaga

2016

EKAIA

Self Cite

View full text Add to dashboard Cite

Laburpena: Gutxi dira sare sozialetan oinarrituz euskara landu duten ikerketa-lanak, eta are gutxiago Euskal Herrian ospatutako ekitaldiek sare sozialetan utzitako aztarnak aztertu dituztenak. Hutsune hori bete eta arlo honetan ikerketa sustatzeko asmoz, lan aitzindaria aurkeztea du helburu artikulu honek. Horretarako, «Euskahaldun» lemapean 2015eko Korrika martxak Twitter sare sozialean sortutako jarduna batzeko erabili dugun metodologia azaldu eta emaitza aztertzen dugu artikulu honetan. Gure analisiak erakusten duenez, zirrara handieneko momentuak Twitterren ere islatzen dira, txio kopuru handien bidez. Horrez gain, euskal komunitatean ikusgarritasuna lortu eta informazioa lau haizetara zabaltzeko ekitaldiarekin lotutako erabiltzaile kontu ofiziala eskaintzearen garrantzia erakusten dugu, eta baita kazetari eta komunikabideen partehar tzea ren beharra ere. Guztion eskura jarri ditugu Twitterrekin antzeko analisiak egiteko tresnak, antzeko ikerketa-lanak sustatzeko asmoz.Hitz-gakoak: sare sozialak, ekitaldiak, Twitter, jarrera, datu meatzaritza.Abstract: This work is motivated by the dearth of research that deals with social media content created from the Basque Country or written in Basque language. While social fingerprints during events have been analysed in numerous other locations and languages, this article aims to fill this gap so as to initiate a much-needed research area within the Basque scientific community. To this end, we describe the methodology we followed to collect tweets posted during Korrika, the quintessential exhibition race in support of the Basque language. We also present the results of the analysis of these tweets. Our analysis shows that the most eventful moments lead to spikes in tweeting activity, producing more tweets. Furthermore, we emphasise the importance of having an official account for the event in question, which helps improve the visibility of the event in the social network as well as the dissemination of information to the Basque community. Along with the official account, journalists and news organisations play a crucial role in the diffusion of information. In order to encourage others to perform further research in the field, we make all the tools publicly available.

show abstract

TweetLID: a benchmark for tweet language identification

Cited by 45 publications

References 30 publications

Automatic Language Identification in Texts: A Survey

Automatic Language Identification in Texts: A Survey

A Perplexity-Based Method for Similar Languages Discrimination

Euskahaldun: euskararen aldeko martxa baten sare sozialetako islaren bilketa eta analisia

Contact Info

Product

Resources

About