Learning Distributed Representations of Uyghur Words and Morphemes

Abudukelimu, Halidanmu; Liu, Yang; Chen, Xinxiong; Sun, Maosong; Abulizi, Abudoukelimu

doi:10.1007/978-3-319-25816-4_17

Cited by 4 publications

(4 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Due to the agglutinative nature of Uyghur and Kazakh, theoretically, an infinite vocabulary can be generated [7]. As a result, data sparsity in agglutinative languages poses a challenge for downstream NLP tasks, as even small datasets lead to a large vocabulary [5].…”

Section: ‫نىڭكى‬mentioning

confidence: 99%

“…In statistical-based stemming or morphological segmentation tasks for Uyghur and Kazakh languages, features such as syllables [31], part-of-speech, context [19,32,33], phonetic classes, the presence of sound change phenomena, and phonetic features [34] are often selected and added to the model to improve its performance. In deep learning-based models, (Bi)RNN [35], BiLSTM-CRF [36], CNN-BiLSTM-CRF [7], pointer networks [37], and attention mechanism [7,37,38] have been used to learn the labels of the input sequence and distinguish morpheme boundaries. The literature mentioned above have introduced labeling schemes, but these labels are not independent, which can easily lead to model overfitting.…”

Section: Related Workmentioning

confidence: 99%

“…They proposed a morphological segmentation model based on a pointer network with a fused attention mechanism, and its segmentation effect is superior to the BiGRU model [35]. Abudukelimu et al [7] also applied the CNN-BiLSTM-CRF model to the morphological segmentation task and compared it with the pointer network [37]; the F1-score improved by 0.33%, comprehensively analyzing typical error types. The model improved the ability to recognize out-of-vocabulary words and low-frequency morphemes.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

A Benchmark for Morphological Segmentation in Uyghur and Kazakh

Abudouwaili,

Ruzmamat,

Abiderexiti

et al. 2024

Applied Sciences

View full text Add to dashboard Cite

Morphological segmentation and stemming are foundational tasks in natural language processing. They have become effective ways to alleviate data sparsity in agglutinative languages because of the nature of agglutinative language word formation. Uyghur and Kazakh, as typical agglutinative languages, have made significant progress in morphological segmentation and stemming in recent years. However, the evaluation metrics used in previous work are character-level based, which may not comprehensively reflect the performance of models in morphological segmentation or stemming. Moreover, existing methods avoid manual feature extraction, but the model’s ability to learn features is inadequate in complex scenarios, and the correlation between different features has not been considered. Consequently, these models lack representation in complex contexts, affecting their effective generalization in practical scenarios. To address these issues, this paper redefines the morphological-level evaluation metrics: F1-score and accuracy (ACC) for morphological segmentation and stemming tasks. In addition, two models are proposed for morpheme segmentation and stem extraction tasks: supervised model and unsupervised model. The supervised model learns character and contextual features simultaneously, then feature embeddings are input into a Transformer encoder to study the correlation between character and context embeddings. The last layer of the model uses a CRF or softmax layer to determine morphological boundaries. In unsupervised learning, an encoder–decoder structure introduces n-gram correlation assumptions and masked attention mechanisms, enhancing the correlation between characters within n-grams and reducing the impact of characters outside n-grams on boundaries. Finally, comprehensive comparative analyses of the performance of different models are conducted from various points of view. Experimental results demonstrate that: (1) The proposed evaluation method effectively reflects the differences in morphological segmentation and stemming for Uyghur and Kazakh; (2) Learning different features and their correlation can enhance the model’s generalization ability in complex contexts. The proposed models achieve state-of-the-art performance on Uyghur and Kazakh datasets.

show abstract

Section: ‫نىڭكى‬mentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Benchmark for Morphological Segmentation in Uyghur and Kazakh

Abudouwaili,

Ruzmamat,

Abiderexiti

et al. 2024

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…From the above discussion, we may state that Turkish NLP studies has to deal with language processing tasks before modelling a solution to the target problem. In general, most-words are composed of many morphemes and they may occur only once on the training data that generates the so called data-sparsity and curse of dimensionality problems [42,43] from computational modelling point of view. It is important to observe that this complexity constrains implementation of state-ofthe-art models and algorithms developed for example for English.…”

Section: Turkish Language Modelling Challenges Based On Its Morphological Complexitymentioning

confidence: 99%

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

et al. 2021

View full text Add to dashboard Cite

Language model pre-training architectures have demonstrated to be useful to learn language representations. bidirectional encoder representations from transformers (BERT), a recent deep bidirectional self-attention representation from unlabelled text, has achieved remarkable results in many natural language processing (NLP) tasks with fine-tuning. In this paper, we want to demonstrate the efficiency of BERT for a morphologically rich language, Turkish. Traditionally morphologically difficult languages require dense language pre-processing steps in order to model the data to be suitable for machine learning (ML) algorithms. In particular, tokenization, lemmatization or stemming and feature engineering tasks are needed to obtain an efficient data model to overcome data sparsity or high-dimension problems. In this context, we selected five various Turkish NLP research problems as sentiment analysis, cyberbullying identification, text classification, emotion recognition and spam detection from the literature. We then compared the empirical performance of BERT with the baseline ML algorithms. Finally, we found enhanced results compared to base ML algorithms in the selected NLP problems while eliminating heavy pre-processing tasks.

show abstract

Construction of an English-Uyghur WordNet Dataset

Abiderexiti

Sun

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Learning Distributed Representations of Uyghur Words and Morphemes

Cited by 4 publications

References 3 publications

A Benchmark for Morphological Segmentation in Uyghur and Kazakh

A Benchmark for Morphological Segmentation in Uyghur and Kazakh

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Construction of an English-Uyghur WordNet Dataset

Contact Info

Product

Resources

About