Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.407
|View full text |Cite
|
Sign up to set email alerts
|

Syllable-based Neural Thai Word Segmentation

Abstract: Word segmentation is a challenging preprocessing step for Thai Natural Language Processing due to the lack of explicit word boundaries. The previous systems rely on powerful neural network architecture alone and ignore linguistic substructures of Thai words. We utilize the linguistic observation that Thai strings can be segmented into syllables, which should narrow down the search space for the word boundaries and provide helpful features. Here, we propose a neural Thai Word Segmenter that uses syllable embedd… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1

Relationship

2
6

Authors

Journals

citations
Cited by 17 publications
(15 citation statements)
references
References 11 publications
0
8
0
Order By: Relevance
“…Competitive Methods. We evaluate our proposed solution against two state-of-the-art methods namely DeepCut (DC) (Kittinaradorn et al, 2019) and AttaCut (AC) (Chormai et al, 2020). These methods are based on the Convolutional Neural Network (CNN) and trained on a generic corpus (BEST2009 (Boriboon et al, 2009)).…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Competitive Methods. We evaluate our proposed solution against two state-of-the-art methods namely DeepCut (DC) (Kittinaradorn et al, 2019) and AttaCut (AC) (Chormai et al, 2020). These methods are based on the Convolutional Neural Network (CNN) and trained on a generic corpus (BEST2009 (Boriboon et al, 2009)).…”
Section: Methodsmentioning
confidence: 99%
“…Evaluation Metrics. We use F1 score as the evaluation metric for the TWS task at character and word levels to avoid the overestimation of TWS (Chormai et al, 2020;Limkonchotiwat et al, 2020). Parameter Settings.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We train the χ 2 , t and WPE-based tokenizers for each language on Wikipedia articles for that For Thai and Chinese, we use the entire Wikipedia database, but for English we use the filtered Wiki103 dataset (Merity et al, 2016). English, Thai, and Chinese documents are tokenized with NLTK (Bird, 2006), Attacut (Chormai et al, 2020), and Stanford Word Segmenter (Tseng et al, 2005) respectively. We follow the same preprecessing steps for the training and the test documents: lemmatize and lowercase in English, and remove stopwords, symbols and digits for all languages.…”
Section: Methodsmentioning
confidence: 99%
“…Aroonmanakun (2002) shows that syllable segmentation can resolve many word-level ambiguities in Thai. Plus, automatic syllable segmentation can be done at a near-perfect accuracy because the task is mostly solved by orthographic rules, assuming few typos exist in the data (Chormai et al, 2020). With syllable boundary indicators, we can avoid errors from word segmentation.…”
Section: Data Collection Proceduresmentioning
confidence: 99%