Data-Driven Morphological Analysis and Disambiguation for Kazakh

Makhambetov, Olzhas; Makazhanov, Aibek; Sabyrgaliyev, Islam; Yessenbayev, Zhandos

doi:10.1007/978-3-319-18111-0_12

Cited by 12 publications

(9 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MC segments are represented as binary vectors that, for a given analysis, encode presence or absence of each morpheme found in the train set. This ensures language independence and contrasts previous work (at least on Turkish and Kazakh), where only certain morphemes are chosen as features depending on their position (Assylbekov et al, 2016;Hakkani-Tür et al, 2002) or presence (Makhambetov et al, 2015) in an analysis, or the authors' intuition (Yildiz et al, 2016;Tolegen et al, 2016;Sak et al, 2007).…”

Section: Introductionmentioning

confidence: 86%

“…Although several statistical models have been proposed for Kazakh MD, such as HMM- (Makazhanov et al, 2014;Makhambetov et al, 2015;Assylbekov et al, 2016), voted perceptron- (Tolegen et al, 2016) and transformation-based (Kessikbayeva and Cicekli, 2016) taggers, to our knowledge ours is the first deep learning-based approach to the problem that is also purely language independent.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Character-Aware Neural Morphological Disambiguation

Toleu¹,

Tolegen²,

Makazhanov³

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 2: Short Papers)

Self Cite

View full text Add to dashboard Cite

We develop a language-independent, deep learning-based approach to the task of morphological disambiguation. Guided by the intuition that the correct analysis should be "most similar" to the context, we propose dense representations for morphological analyses and surface context and a simple yet effective way of combining the two to perform disambiguation. Our approach improves on the languagedependent state of the art for two agglutinative languages (Turkish and Kazakh) and can be potentially applied to other morphologically complex languages.

show abstract

Section: Introductionmentioning

confidence: 86%

Section: Related Workmentioning

confidence: 99%

Character-Aware Neural Morphological Disambiguation

Toleu¹,

Tolegen²,

Makazhanov³

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 2: Short Papers)

Self Cite

View full text Add to dashboard Cite

show abstract

“…This is not the same task as the one we are exploring, where the objective is to return the complete set of possible analyses. Similar in spirit is the work on Kazakh morphological analysis by Makhambetov et al (2015). Their system, based on Hidden Markov Models, returns a subset of the analyses of a token which could plausibly occur in a given context.…”

Section: Related Workmentioning

confidence: 97%

Proceedings of the Fourth International Workshop on Computatinal Linguistics of Uralic Languages

Pirinen¹,

Rießler²,

Rueter³

et al. 2018

View full text Add to dashboard Cite

PrefaceThe 4th International Workshop on Computational Linguistics for the Uralic Languages (IWCLUL) continues the annual meetings ACL SIGUR (Association of computational linguistics' special interest group for Uralic languages) after St. Petersburg (2017), Szeged (2016), and Tromsø (2015). It took place in Helsinki from 8th to 9th January, 2018 and was organized in collaboration with the NLP Research Group at the University of Helsinki.should repeat the complete info in order to let this page of the proceedings explain itself (people might not look through the other pages)This year we received a total of 20 submissions of which we accepted 15 (one of which was withdrawn by the authors) giving total of 14 high-quality papers in the final proceedings and an acceptance rate of 75 %. The accepted papers represent a variety of languages and growing resources in the Uralic landscape: Finnish, Komi-Zyrian, Udmurt, Erzya, Northern Sámi, Pite Sámi, Nganasan and Estonian; topics covered treebanks, parsing, code-switching, language generation, automatic speech recognition, morphology, and typological treatment across all Uralic languages, among others.During this year's annual meeting we also had the first election of the ACL SIGUR board after the establishment of the new SIG in Szeged in 2016. The current board was re-elected by the ACL SIGUR membership for two further years.We thank the programming committee, local organisers and participants for making annual meetings of ACL SIG for Uralic languages possible. AbstractThis paper describes the test of a dependency parsing method which is based on bidirectional LSTM feature representations and multilingual word embedding, and evaluates the results on mono-and multilingual data. The results are similar in all cases, with a slightly better results achieved using multilingual data. The languages under investigation are Komi-Zyrian and Russian. Examination of the results by relation type shows that some language specific constructions are correctly recognized even when they appear in naturally occurring code-switching data. TiivistelmäTutkimus arvioi dependenssianalyysin menetelmää, joka perustuu kaksisuuntaiseen LSTM-piirrerepresentaatioon ja monikieliseen 'word embedding' -malliin, sekä arvioi tuloksia yksi-ja monikielisissä aineistoissa. Tulokset ovat samantapaisia, mutta hieman korkeampia moni-kuin yksikielisissä aineistoissa. Tutkitut kielet ovat komisyrjääni ja venäjä. Tulosten yksityiskohtaisempi analyysi riippuvuuksien mukaan osoittaa, että tietyt kielikohtaiset suhteet on tunnistettu oikein jopa niiden esiintyessä luonnollisissa koodinvaihtoa sisältävissä lauseissa. IntroductionSpontaneous speech data of small, endangered languages most commonly contain code-switching, ad-hoc borrowings and other kinds of language contact phenomena originating from the non-target contact language(s). Consequently, spoken corpora originating from such data contain numerous utterances in which linguistic elements from at least two languages co-occur. The most usual occurrences are c...

show abstract

“…Because dictionary entries are lemmatized, during cleaning we perform lemmatization on both source and target sides of the training set, and later restore the target side of the cleaned data. For target side lemmatization we use a data-driven morphological disambiguator for Kazakh [10]. We implement the models using the Moses toolkit [29], setting the distortion limit parameter to -1 (infinity) to account for long range dependencies and free word order of the languages.…”

Section: Experiments and Evaluationmentioning

confidence: 99%

“…From technical perspective, there is another challenge that concerns mostly Kazakh in its lack of resources for our particular purposes. By and large the language is being actively studied, and there exist monolingual corpora [6,7], and ongoing research on morphological processing [8][9][10][11][12][13] and syntactic parsing [14][15][16]. However, except for a rather small and noisy OPUS corpus [17] there are no Russian-Kazakh parallel corpora 4 and the only tool for automatic morphological disambiguation of Kazakh available to us 5 was reported to have accuracy of 86%, which we considered to be low enough to question the results of experiments with segmentation: would possible misalignments be shortcomings of a chosen segmentation scheme or results of incorrect morphological analysis and disambiguation.…”

Section: Introductionmentioning

confidence: 99%

Initial Experiments on Russian to Kazakh SMT

Myrzakhmetov¹,

Makazhanov²

2016

RCS

Self Cite

View full text Add to dashboard Cite

We present our initial experiments on Russian to Kazakh phrase-based statistical machine translation. Following a common approach to SMT between morphologically rich languages, we employ morphological processing techniques. Namely, for our initial experiments, we perform source-side lemmatization. Given a rather humble-sized parallel corpus at hand, we also put some effort in data cleaning and investigate the impact of data quality vs. quantity trade off on the overall performance. Although our experiments mostly focus on source side preprocessing we achieve a substantial, statistically significant improvement over the baseline that operates on raw, unprocessed data.

show abstract

Data-Driven Morphological Analysis and Disambiguation for Kazakh

Cited by 12 publications

References 12 publications

Character-Aware Neural Morphological Disambiguation

Character-Aware Neural Morphological Disambiguation

Proceedings of the Fourth International Workshop on Computatinal Linguistics of Uralic Languages

Initial Experiments on Russian to Kazakh SMT

Contact Info

Product

Resources

About