2021
DOI: 10.1145/3464378
|View full text |Cite
|
Sign up to set email alerts
|

Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion

Abstract: As a highly analytic language, Khmer has considerable ambiguities in tokenization and part-of-speech (POS) tagging processing. This topic is investigated in this study. Specifically, a 20,000-sentence Khmer corpus with manual tokenization and POS-tagging annotation is released after a series of work over the last 4 years. This is the largest morphologically annotated Khmer dataset as of 2020, when this article was prepared. Based on the annotated data, experiments were conducted to esta… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 19 publications
0
1
0
Order By: Relevance
“…Khmer is the official language of Cambodia and is spoken by approximately 17 million speakers. The Khmer script is an abugida system in which each consonant is attached to an inherent, invisible vowel [22]. In the Khmer writing system, there are 33 consonants, 14 independent vowels, 23 dependent vowels, and eight diacritics.…”
Section: Khmer Script As a Representative Of Non-latin Scriptsmentioning
confidence: 99%
“…Khmer is the official language of Cambodia and is spoken by approximately 17 million speakers. The Khmer script is an abugida system in which each consonant is attached to an inherent, invisible vowel [22]. In the Khmer writing system, there are 33 consonants, 14 independent vowels, 23 dependent vowels, and eight diacritics.…”
Section: Khmer Script As a Representative Of Non-latin Scriptsmentioning
confidence: 99%
“…Part-of-Speech Tagging is one of the downstream tasks where different tokenization-based methods are employed in low resource languages [Ding et al 2019a[Ding et al , 2018Kaing et al 2021]. Morphological analysis is used to propose a tokenization system for Kurdish [Ahmadi 2020].…”
Section: Tokenization In Low-resource Languagesmentioning
confidence: 99%
“…Therefore, to achieve high accuracy using the rule-based approach, an extensive set of rules must be established to account for various scenarios and exceptions (Ding et al, 2018). There is another category of tools known as hybrid systems, which often outperform purely rule-based or statistical approaches.…”
Section: Introductionmentioning
confidence: 99%