2019
DOI: 10.1017/s1351324919000366
|View full text |Cite
|
Sign up to set email alerts
|

How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

Abstract: Part-of-speech (PoS) tagging of non-standard language with models developed for standard language is known to suffer from a significant decrease in accuracy. Two methods are typically used to improve it: word normalisation, which decreases the out-of-vocabulary rate of the PoS tagger, and domain adaptation where the tagger is made aware of the non-standard language variation, either through supervision via non-standard data being added to the tagger’s training set, or via distributional information calculated … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 34 publications
0
4
0
Order By: Relevance
“…Although lexical normalization potentially removes social signals (Nguyen et al, 2021), it has also been shown to boost many downstream NLP tasks, including named entity recognition (Schulz et al, 2016;Plank et al, 2020), POS tagging (Derczynski et al, 2013;Schulz et al, 2016; Zupan et al, 2019), dependency and constituency parsing (Baldwin and Li, 2015;van der Goot et al, 2020;van der Goot and van Noord, 2017), sentiment analysis (Van Hee et al, 2017;Sidarenka, 2019, pp. 79, 122), and machine translation (Bhat et al, 2018).…”
Section: Definition -Lexical Normalizationmentioning
confidence: 99%
“…Although lexical normalization potentially removes social signals (Nguyen et al, 2021), it has also been shown to boost many downstream NLP tasks, including named entity recognition (Schulz et al, 2016;Plank et al, 2020), POS tagging (Derczynski et al, 2013;Schulz et al, 2016; Zupan et al, 2019), dependency and constituency parsing (Baldwin and Li, 2015;van der Goot et al, 2020;van der Goot and van Noord, 2017), sentiment analysis (Van Hee et al, 2017;Sidarenka, 2019, pp. 79, 122), and machine translation (Bhat et al, 2018).…”
Section: Definition -Lexical Normalizationmentioning
confidence: 99%
“…Even more, the CSMT approach has shown to behave very similarly in a controlled comparison on various types of non-standard data, such as with Slovenian user-generated content and historical texts (Ljubešic et al, 2016). It has also shown to be the preferred way of adapting language technologies to non-standard data if the availability of human supervision is low (Zupan et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…It is only in recent years that this situation has improved through the development of a language-independent universal part-ofspeech tag set (Petrov, Das, and McDonald 2012), a language-independent universal dependency annotation scheme (McDonald et al 2013), a unified feature-value inventory for morphological features (Zeman 2008), and the subsequent merging of the three schemes within the Universal Dependencies project (Nivre et al 2016). However, despite these huge harmonization efforts, different annotation traditions still shine in the currently available corpora (Zupan, Ljubešić, and Erjavec 2019), and as a result even recent research sometimes resorts to some kind of ad hoc label normalization (Rosa et al 2017).…”
Section: Dari Xmentioning
confidence: 99%
“…For example, Scherrer and Rabus (2019) represented the input words using a bidirectional character-level long short-term memory (LSTM) recurrent neural network and obtained up to 13% absolute boost in terms of F1-score compared to using atomic word-level representations. Zupan et al (2019) also stressed the importance of characterlevel input representations. In high-resource settings, character-level input representations, which are computationally costly, have lately been replaced by fixed-size vocabularies obtained by unsupervised subword segmentation methods such as byte-pair encodings and word-pieces (Sennrich, Haddow, and Birch 2016;Kudo and Richardson 2018); however, for the moment this change appears to be less relevant in low-resource tagging and parsing settings.…”
Section: Dari Xmentioning
confidence: 99%