Thai Word Segmentation with Hidden Markov Model and Decision Tree

Bheganan, Poramin; Nayak, Richi; Xu, Yan

doi:10.1007/978-3-642-01307-2_10

Cited by 11 publications

(4 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first three are essentially sequence tagging models that involve assigning a label or tag to each element in the sequence of input data. For sequence tagging tasks, NLP researchers use several supervised machine learning algorithms, such as HMM (Hidden Markov Model) [27], [28], RNN (Recurrent Neural Networks) [29], [30], [31], [32], [33], and CRF (Conditional Random Fields) [34], [35], [36], [37]. For various sequence tagging tasks, such as Word Segmentation and POS tagging, the CRF usually outperforms the other models.…”

Section: Modelsmentioning

confidence: 99%

NLPashto: NLP Toolkit for Low-resource Pashto Language

Haq¹,

Qiu²,

Jie³

et al. 2023

IJACSA

View full text Add to dashboard Cite

In recent years, natural language processing (NLP) has transformed numerous domains, becoming a vital area of research. However, the focus of NLP studies has predominantly centered on major languages like English, inadvertently neglecting low-resource languages like Pashto. Pashto, spoken by a population of over 50 million worldwide, remains largely unexplored in NLP research, lacking off-the-shelf resources and tools even for fundamental text-processing tasks. To bridge this gap, this study presents NLPashto, an open-source and publicly accessible NLP toolkit specifically designed for Pashto. The initial version of NLPashto introduces four state-of-the-art models for Spelling Correction, Word Segmentation, Part-of-Speech (POS) Tagging, and Offensive Language Detection. The toolkit also includes essential NLP resources like pre-trained static word embeddings, Word2Vec, fastText, and GloVe. Furthermore, we have pre-trained a monolingual language model for Pashto from scratch, using the Bidirectional Encoder Representations from Transformers (BERT) architecture. For the training and evaluation of all the models, we have developed several benchmark datasets and also included them in the toolkit. Experimental results demonstrate that the models exhibit satisfactory performance in their respective tasks. This study can be a significant milestone and will hopefully support and speed-up future research in the field of Pashto NLP.

show abstract

Section: Modelsmentioning

confidence: 99%

NLPashto: NLP Toolkit for Low-resource Pashto Language

Haq¹,

Qiu²,

Jie³

et al. 2023

IJACSA

View full text Add to dashboard Cite

show abstract

“…Using TCCs, (Theeramunkong and Usanavasin, 2001) develop a decision tree classifier to determine whether a word should be formed from TCCs, based on a predefined metric. (Aroonmanakun, 2002) presents a twostage word segmentation that incorporates handcrafted syllable features with dy namic programming to form the most reasonable segmentation, while (Bheganan et al, 2009) use a hid den Markov model to form words that are then verified with a dictionary. Modern approaches to this problem are used by practitioners, but it is unclear which approach is the most accurate or fastest.…”

Section: Related Workmentioning

confidence: 99%

Syllable-based Neural Thai Word Segmentation

Chormai¹,

Prasertsom²,

Cheevaprawatdomrong³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Word segmentation is a challenging preprocessing step for Thai Natural Language Processing due to the lack of explicit word boundaries. The previous systems rely on powerful neural network architecture alone and ignore linguistic substructures of Thai words. We utilize the linguistic observation that Thai strings can be segmented into syllables, which should narrow down the search space for the word boundaries and provide helpful features. Here, we propose a neural Thai Word Segmenter that uses syllable embeddings to capture linguistic constraints and uses dilated CNN filters to capture the environment of each character. Within this goal, we develop the first MLbased Thai orthographical syllable segmenter, which yields syllable embeddings to be used as features by the word segmenter. Our word segmentation system outperforms the previous stateoftheart system in both speed and accuracy on both indomain and outdomain datasets.

show abstract

“…They can be grouped into two categories which could be dictionary based or non-dictionary based. Dictionary based approaches include techniques such as longest matching [11] [12] [13] [14], maximum matching [12] [13] [14] and decision tree [15]. Non-dictionary based include rule-based [16], Hidden Markov Model (HMM) [15] [17] and Native Bayesian [14].…”

Section: Background On Thai Word Segmentationmentioning

confidence: 99%

Thai word segmentation for visualization of Thai Web sites

Thanadechteemapat

Fung

2011

2011 International Conference on Machine Learning and Cybernetics

View full text Add to dashboard Cite

Abstract:Information overload is a problem in the Information Age and Information visualization is an approach to provide an overview of the content of a web site. Tag cloud is one of the ways to represent information as an image of a group of words. However, there are limitations on tag cloud generation, and one of them is due to the characteristics for the language. In order to extract tags or words for tag cloud, word segmentation is required. This paper proposes a Thai word segmentation approach for the visualization of Thai Web sites. The proposed Thai word segmentation technique is based on the longest matching technique together with a refined corpus. The results of Thai word segmentation are compatible with the results from previous BEST's contests in Thailand.

show abstract

Thai Word Segmentation with Hidden Markov Model and Decision Tree

Cited by 11 publications

References 5 publications

NLPashto: NLP Toolkit for Low-resource Pashto Language

NLPashto: NLP Toolkit for Low-resource Pashto Language

Syllable-based Neural Thai Word Segmentation

Thai word segmentation for visualization of Thai Web sites

Contact Info

Product

Resources

About