Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.315
|View full text |Cite
|
Sign up to set email alerts
|

Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble

Abstract: Like many Natural Language Processing tasks, Thai word segmentation is domain-dependent. Researchers have been relying on transfer learning to adapt an existing model to a new domain. However, this approach is inapplicable to cases where we can interact with only input and output layers of the models, also known as "black boxes". We propose a filter-and-refine solution based on the stackedensemble learning paradigm to address this black-box limitation. We conducted extensive experimental studies comparing our … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 10 publications
(14 citation statements)
references
References 13 publications
0
14
0
Order By: Relevance
“…Unlike English, a white space character is not considered as a reliable word boundary indicator in several languages [21]. Similarly, Urdu does not have consistent word boundary markings.…”
Section: ‫ء‬mentioning
confidence: 99%
“…Unlike English, a white space character is not considered as a reliable word boundary indicator in several languages [21]. Similarly, Urdu does not have consistent word boundary markings.…”
Section: ‫ء‬mentioning
confidence: 99%
“…• SEFR tokenizer; Stacked Ensemble Filter and Refine tokenizer (engine= best¨) [Limkonchotiwat et al, 2020] based on probablities from CNN-based deepcut [Kittinaradorn et al, 2019] with a vocab size of 92,177 words.…”
Section: Preprocessingmentioning
confidence: 99%
“…The following table shows the performance RoBERTa BASE trained on Wikipedia-only dataset. There are four variations of tokenization including subword-level with SentencePiece [Kudo and Richardson, 2018], wordlevel with PyThaiNLP [Phatthiyaphaibun et al, 2020] dictionary-based tokenizer newmm, subword-level with a CRF-based Thai syllable tokenizer ssg , and stacked-ensemble, word-level tokenizer sefr [Limkonchotiwat et al, 2020]. For the RoBERTa BASE trained on Assorted Thai Texts dataset, we only trained with subword token built with Senten-cePiece [Kudo and Richardson, 2018] due to the limited computational resources.…”
Section: Language Modelingmentioning
confidence: 99%
See 1 more Smart Citation
“…The purpose of a corpus in SA is to train the machine learning models with high accuracy. Unfortunately, most of the corpora available as a resource for SA are in English or other popular languages (Ananiadou et al, 2013;Batista-Navarro et al, 2013;Limkonchotiwat et al, 2020;.…”
mentioning
confidence: 99%