2020
DOI: 10.48550/arxiv.2011.02063
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Treebanking User-Generated Content: a UD Based Overview of Guidelines, Corpora and Unified Recommendations

Abstract: This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework of syntactic analysis. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this article is twofold: (1) to p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 8 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…• HaSpeeDe-tw2018 and HaSpeeDe-tw2020, the datasets released during the EVALITA campaigns (Sanguinetti et al, 2020),…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…• HaSpeeDe-tw2018 and HaSpeeDe-tw2020, the datasets released during the EVALITA campaigns (Sanguinetti et al, 2020),…”
Section: Related Workmentioning
confidence: 99%
“…This applies especially when seeing the considerable amount of variation they contain, not only in the language itself, but also in the usage domains and among the individual language users (cf. Sanguinetti et al 2020). As a matter of fact, some distinctive features of non-standard texts are the broad variation in the structure and punctuation of utterances, namely in the syntax, but also at lexical level due to the use of abbreviations, domain-specific symbols or incorrect derivational forms, as well as code-switching for learners' language.…”
Section: Pos-tagging Standard Vs Nonstandard Languagementioning
confidence: 99%