Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion

Kaing, Hour; Ding, Chenchen; Utiyama, Masao; Sumita, Eiichiro; Sam, Sethserey; Seng, Sopheap; Sudoh, Katsuhito; Nakamura, Satoshi

doi:10.1145/3464378

Cited by 6 publications

(3 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Khmer is the official language of Cambodia and is spoken by approximately 17 million speakers. The Khmer script is an abugida system in which each consonant is attached to an inherent, invisible vowel [22]. In the Khmer writing system, there are 33 consonants, 14 independent vowels, 23 dependent vowels, and eight diacritics.…”

Section: Khmer Script As a Representative Of Non-latin Scriptsmentioning

confidence: 99%

Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition

Buoy,

Iwamura,

Srun

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Many existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit word boundaries. In addition, from a natural language processing (NLP) perspective, most non-Latin languages are considered low-resource due to the scarcity of large-scale data. This paper presents a convolutional Transformer-based text recognition method for low-resource non-Latin scripts, which uses local twodimensional (2D) feature maps. The proposed method can handle images of arbitrarily long textlines, which may occur with non-Latin writing without explicit word boundaries, without resizing them to a fixed size by using an improved image chunking and merging strategy. It has a low time complexity in self-attention layers and allows efficient training. The Khmer script is used as the representative of non-Latin scripts because it shares many features with other non-Latin scripts, which makes the construction of an optical character recognition (OCR) method for Khmer as hard as that for other non-Latin scripts. Thus, by analogy with the AI-complete concept, a Khmer OCR method can be considered as one of the non-Latin-complete methods and can be used as a low-resource non-Latin baseline method. The proposed 2D method was trained on synthetic datasets and outperformed the baseline models on both synthetic and real datasets. Fine-tuning experiments using Khmer handwritten palm leaf manuscripts and other non-Latin scripts demonstrated the feasibility of transfer learning from the Khmer OCR method. To contribute to the low-resource language community, the training and evaluation datasets will be made publicly available.INDEX TERMS Khmer script, non-Latin scripts, character stacking, no explicit word boundaries, text recognition, image chunking

show abstract

Section: Khmer Script As a Representative Of Non-latin Scriptsmentioning

confidence: 99%

Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition

Buoy,

Iwamura,

Srun

et al. 2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Part-of-Speech Tagging is one of the downstream tasks where different tokenization-based methods are employed in low resource languages [Ding et al 2019a[Ding et al , 2018Kaing et al 2021]. Morphological analysis is used to propose a tokenization system for Kurdish [Ahmadi 2020].…”

Section: Tokenization In Low-resource Languagesmentioning

confidence: 99%

Impact of Tokenization on Language Models: An Analysis for Turkish

Toraman¹,

Yılmaz²,

Şahi̇nuç³

et al. 2022

Preprint

View full text Add to dashboard Cite

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks.Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.

show abstract

“…Therefore, to achieve high accuracy using the rule-based approach, an extensive set of rules must be established to account for various scenarios and exceptions (Ding et al, 2018). There is another category of tools known as hybrid systems, which often outperform purely rule-based or statistical approaches.…”

Section: Introductionmentioning

confidence: 99%

Part-Of-Speech Tagging for Balochi Language: A Data driven application of Conditional Random Fields

Ullah,

Ali,

Chandio

et al. 2024

ABBDM

View full text Add to dashboard Cite

Parts-of-Speech (POS) tagging involves the assignment of the correct part of speech or lexical category to individual words within a sentence in a natural language. This procedure holds significant in the field of Natural Language Processing (NLP) and find utility across a variety of NLP applications. Commonly, it constitutes the initial phase of natural language processing. Subsequent stages may encompass additional tasks such as chunking, parsing and more. Balochi stands as the predominant language in Balochistan,, ranking as the fourth most prevalent language in Pakistan. The field of natural language processing for Balochi is still in its nascent stages. In this research, we introduce an algorithm for Balochi part-of-speech tagging, leveraging machine learning techniques. The core of our approach relies on a Conditional Random Field model as the machine learning component. Careful consideration is given to selecting appropriate features for the CRF, taking into account the linguistic characteristics of Balochi. Balochi is currently considered a resource poor language, and thus, the available manually tagged data consists of only approximately 1500 sentences. The tagset used in this study created for research purpose, consisting of 16 different tags. The learning process incorporates tagged data. The algorithm demonstrates a high accuracy rate of 86.78% when applied to Balochi texts. The training corpus comprises 40000 words, while the test corpus contains 10000 words.

show abstract

Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion

Cited by 6 publications

References 19 publications

Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition

Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition

Impact of Tokenization on Language Models: An Analysis for Turkish

Part-Of-Speech Tagging for Balochi Language: A Data driven application of Conditional Random Fields

Contact Info

Product

Resources

About