On certain aspects of Kazakh part-of-speech tagging

This research is aimed at identifying the parts of speech for the Kazakh and Turkish languages in an information retrieval system. The proposed algorithms are based on machine learning techniques. In this paper, we consider the binary classification of words according to parts of speech. We decided to take the most popular machine learning algorithms. In this paper, the following approaches and well-known machine learning algorithms are studied and considered. We defined 7 dictionaries and tagged 135 million words in Kazakh and 9 dictionaries and 50 million words in the Turkish language. The main problem considered in the paper is to create algorithms for the execution of dictionaries of the so-called Link Grammar Parser (LGP) system, in particular for the Kazakh and Turkish languages, using machine learning techniques. The focus of the research is on the review and comparison of machine learning algorithms and methods that have accomplished results on various natural language processing tasks such as grammatical categories determination. For the operation of the LGP system, a dictionary is created in which a connector for each word is indicated – the type of connection that can be created using this word. The authors considered methods of filling in LGP dictionaries using machine learning. The complexities of natural language processing, however, do not exclude the possibility of identifying narrower tasks that can already be solved algorithmically: for example, determining parts of speech or splitting texts into logical groups. However, some features of natural languages significantly reduce the effectiveness of these solutions. Thus, taking into account all word forms for each word in the Kazakh and Turkish languages increases the complexity of text processing by an order of magnitude

show abstract

“…Kazcorpus Kazakh language corpus exceeds 135 million words [23,24] and it contains more than 400.000 documents classified into five major genres:…”

Section: Word2vec Algorithm For Parts Of Speech Determinationmentioning

confidence: 99%

Grammatical categories determination for Turkish and Kazakh languages based on machine learning algorithms and fulfilling dictionaries of link grammar parser

Yerimbetova

Tussupova²,

Sambetbayeva

et al. 2021

EEJET

View full text Add to dashboard Cite

show abstract

“…Although several statistical models have been proposed for Kazakh MD, such as HMM- (Makazhanov et al, 2014;Makhambetov et al, 2015;Assylbekov et al, 2016), voted perceptron- (Tolegen et al, 2016) and transformation-based (Kessikbayeva and Cicekli, 2016) taggers, to our knowledge ours is the first deep learning-based approach to the problem that is also purely language independent.…”

Section: Related Workmentioning

confidence: 99%

Character-Aware Neural Morphological Disambiguation

Toleu¹,

Tolegen²,

Makazhanov³

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 2: Short Papers)

Self Cite

View full text Add to dashboard Cite

We develop a language-independent, deep learning-based approach to the task of morphological disambiguation. Guided by the intuition that the correct analysis should be "most similar" to the context, we propose dense representations for morphological analyses and surface context and a simple yet effective way of combining the two to perform disambiguation. Our approach improves on the languagedependent state of the art for two agglutinative languages (Turkish and Kazakh) and can be potentially applied to other morphologically complex languages.

show abstract

“…From technical perspective, there is another challenge that concerns mostly Kazakh in its lack of resources for our particular purposes. By and large the language is being actively studied, and there exist monolingual corpora [6,7], and ongoing research on morphological processing [8][9][10][11][12][13] and syntactic parsing [14][15][16]. However, except for a rather small and noisy OPUS corpus [17] there are no Russian-Kazakh parallel corpora 4 and the only tool for automatic morphological disambiguation of Kazakh available to us 5 was reported to have accuracy of 86%, which we considered to be low enough to question the results of experiments with segmentation: would possible misalignments be shortcomings of a chosen segmentation scheme or results of incorrect morphological analysis and disambiguation.…”

Section: Introductionmentioning

confidence: 99%

Initial Experiments on Russian to Kazakh SMT

Myrzakhmetov¹,

Makazhanov²

2016

RCS

Self Cite

View full text Add to dashboard Cite

We present our initial experiments on Russian to Kazakh phrase-based statistical machine translation. Following a common approach to SMT between morphologically rich languages, we employ morphological processing techniques. Namely, for our initial experiments, we perform source-side lemmatization. Given a rather humble-sized parallel corpus at hand, we also put some effort in data cleaning and investigate the impact of data quality vs. quantity trade off on the overall performance. Although our experiments mostly focus on source side preprocessing we achieve a substantial, statistically significant improvement over the baseline that operates on raw, unprocessed data.

show abstract

On certain aspects of Kazakh part-of-speech tagging

Cited by 7 publications

References 17 publications

Grammatical categories determination for Turkish and Kazakh languages based on machine learning algorithms and fulfilling dictionaries of link grammar parser

Grammatical categories determination for Turkish and Kazakh languages based on machine learning algorithms and fulfilling dictionaries of link grammar parser

Character-Aware Neural Morphological Disambiguation

Initial Experiments on Russian to Kazakh SMT

Contact Info

Product

Resources

About