Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing 2017
DOI: 10.18653/v1/w17-1410
|View full text |Cite
|
Sign up to set email alerts
|

Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text

Abstract: In this paper we present the adaptations of a state-of-the-art tagger for South Slavic languages to non-standard texts on the example of the Slovene language. We investigate the impact of introducing in-domain training data as well as additional supervision through external resources or tools like word clusters and word normalization. We remove more than half of the error of the standard tagger when applied to nonstandard texts by training it on a combination of standard and non-standard training data, while e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
3
3

Relationship

2
4

Authors

Journals

citations
Cited by 10 publications
(8 citation statements)
references
References 10 publications
0
8
0
Order By: Relevance
“…Because of that, our results should be compared to the performance of taggers trained on written data from languages with rich morphology applied to spoken data as test files. As is well known and has been demonstrated numerous times, the use of standard training data for tagging nonstandard, more specifically oral, test data inevitably leads to worse results as compared to standard test data (Ljubešić, Erjavec, & Fišer, 2017).…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…Because of that, our results should be compared to the performance of taggers trained on written data from languages with rich morphology applied to spoken data as test files. As is well known and has been demonstrated numerous times, the use of standard training data for tagging nonstandard, more specifically oral, test data inevitably leads to worse results as compared to standard test data (Ljubešić, Erjavec, & Fišer, 2017).…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…For cases when Brown clustering information is provided to the tagger, the tagger encodes additional features of all binary paths from the root that have even length (Ljubešić et al 2017).…”
Section: Methodsmentioning
confidence: 99%
“…In particular, on standard training and testing data ReLDI-tagger obtained 94.27% accuracy on the fine-grained tagset with full morphosyntactic descriptions (Ljubešić and Erjavec 2016) while Bilty (on the same data and a large non-annotated data set used for embedding initialisation) obtained an accuracy of 94.29%. Similarly, when applied to non-standard data, an adapted ReLDI-tagger obtained an accuracy of 87.70% (Ljubešić, Erjavec, and Fišer 2017), while Bilty achieves 88.15%.…”
Section: Introductionmentioning
confidence: 99%
“…Whereas South Slavic languages are generally less supported in the NLP, they have been investigated in terms of user-generated content. For example, sentiment classification of Croatian Game reviews and Tweets is investigated in (Rotim anď Snajder, 2017), and (Ljubešić et al, 2017) proposes adapting a standard-text Slovenian POS tagger to tweets, forum posts, and user comments on blog posts and news articles. These languages have been dealt with in machine translation research as well.…”
Section: Related Workmentioning
confidence: 99%