Proceedings of the 10th Linguistic Annotation Workshop Held In Conjunction With ACL 2016 (LAW-X 2016) 2016
DOI: 10.18653/v1/w16-1714
|View full text |Cite
|
Sign up to set email alerts
|

Part of Speech Annotation of a Turkish-German Code-Switching Corpus

Abstract: In this paper we describe our efforts on POS annotation of a code-switching corpus created from Turkish-German tweets. We use Universal Dependencies (UD) POS tags as our tag set. While the German parts of the corpus employ UD specifications, for the Turkish parts we propose annotation guidelines that adopt UD's language-general rules when it is applicable and adapt its principles to Turkishspecific phenomena when it is not. The resulting corpus has POS annotation of 1029 tweets, which is aligned with existing … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
3

Relationship

4
4

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 22 publications
0
4
0
Order By: Relevance
“…• Code-switched Turkish-German tweets were annotated based on Universal Dependencies POS tags and the authors proposed guidelines for the Turkish parts to adopt language-general heuristics to gather a corpus of 1029 tweets [68].…”
Section: Part Of Speech (Pos) Taggingmentioning
confidence: 99%
See 1 more Smart Citation
“…• Code-switched Turkish-German tweets were annotated based on Universal Dependencies POS tags and the authors proposed guidelines for the Turkish parts to adopt language-general heuristics to gather a corpus of 1029 tweets [68].…”
Section: Part Of Speech (Pos) Taggingmentioning
confidence: 99%
“…They claim that joint modeling of all these tasks is expected to yield better results. [68] presented POS annotation for Turkish-German tweets that align with existing language identification based on POS tags from Universal Dependencies. [69] explored the exploitation of monolingual resources such as taggers (for Spanish and English data)…”
Section: Pos Taggingmentioning
confidence: 99%
“…In order to successfully process the data available from such sources, linguistic analysis is often helpful (Vilares et al 2017;Mataoui et al 2018), which in turn prompts the use of NLP tools to that end. Despite the ever increasing number of contributions, especially on Part-of-Speech tagging (Gimpel et al 2011;Owoputi et al 2013;Lynn et al 2015;Bosco et al 2016;Çetinoglu and Çöltekin 2016;Proisl 2018;Rehbein et al 2018;Behzad and Zeldes 2020) and parsing (Foster 2010;Petrov and McDonald 2012;Liu et al 2018;Sanguinetti et al 2018), automatic processing of usergenerated content (UGC) still represents a challenging task, as is shown by some tracks of the workshop series on noisy user-generated text (W-NUT). 1 UGC is a continuum of text sub-domains that vary considerably according to the specific conventions and limitations posed by the medium used (blog, discussion forum, online chat, microblog, etc.…”
Section: Introductionmentioning
confidence: 99%
“…Obviously, with all these factors evolving and the times changing, the research community itself has been taking into account more and more the fact that a new kind of data treatment was necessary and that a renovation regarding annotation formats, guidelines and tagsets was also needed. Especially taking into account the scientific community within computational linguistic studies that deals with morphology, syntax and the contributors of Universal Dependencies, it can be appreciated that in order to successfully process the data available from such sources, an increasing number of contributions, especially on Part-of-Speech tagging [Gimpel et al, 2011, Owoputi et al, 2013, Lynn et al, 2015, Bosco et al, 2016a, Çetinoğlu and Çöltekin, 2016, Proisl, 2018, Rehbein et al, 2018, Behzad and Zeldes, 2020 and parsing [Foster, 2010, Kong et al, 2014, Liu et al, 2018b has been produced in the last decade. Nevertheless, the automatic processing of user-generated content still represents a challenging task, as it is a continuum of text sub-domains that vary considerably according to the specific conventions and limitations posed by the medium used (blog, discussion forum, online chat, microblog, etc.…”
Section: The Ud Framework and Social Media Datamentioning
confidence: 99%