Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-Nut 2021) 2021
DOI: 10.18653/v1/2021.wnut-1.45
|View full text |Cite
|
Sign up to set email alerts
|

The Korean Morphologically Tight-Fitting Tokenizer for Noisy User-Generated Texts

Abstract: User-generated texts include various types of stylistic properties, or noises. Such texts are not properly processed by existing morpheme analyzers or language models based on formal texts such as encyclopedias or news articles. In this paper, we propose a simple morphologically tight-fitting tokenizer (K-MT) that can better process proper nouns, coinages, and internet slang among other types of noise in Korean user-generated texts. We tested our tokenizer by performing classification tasks on Korean user-gene… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 13 publications
0
0
0
Order By: Relevance
“…Some have tried to tackle it as a text normalization problem during preprocessing (Han et al, 2013;Supranovich and Patsepnia, 2015;Benamar et al, 2021;Demir and Topcu, 2022). Others have attempted to get to the root of the problem by improving the tokenization algorithm or using a character-based model (Hofmann et al, 2021;Lee and Shin, 2021;Tay et al, 2021;Wang et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…Some have tried to tackle it as a text normalization problem during preprocessing (Han et al, 2013;Supranovich and Patsepnia, 2015;Benamar et al, 2021;Demir and Topcu, 2022). Others have attempted to get to the root of the problem by improving the tokenization algorithm or using a character-based model (Hofmann et al, 2021;Lee and Shin, 2021;Tay et al, 2021;Wang et al, 2021).…”
Section: Related Workmentioning
confidence: 99%