A hybrid Statistical Approach to Stemming in Turkish: An Agglutinative Language

Kişla, Tarık; Karaoğlan, Bahar

doi:10.18038/btda.31812

Cited by 4 publications

(3 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…According to the study, it is possible to derive about 1.5 million different words from a Noun [masa (table)] and from a Verb [oku (read)] only with the use of derivational morphemes [40]. The morphological structure of Turkish word is shown in Figure 1 [41].…”

Section: Turkish Language Modelling Challenges Based On Its Morphological Complexitymentioning

confidence: 99%

“…Some samples for morphological productivity of Turkish language are provided in Table 1 [41]. As it is obvious from Table 1, the number of suffixes and their imaginable combinations that can be added to a word generate a serious language analysis problem to obtain actual stem from possible derivations.…”

Section: Turkish Language Modelling Challenges Based On Its Morphological Complexitymentioning

confidence: 99%

“…However, the two methods are not interchangeable and it should be carefully examined which one is better for the corresponding language problem. For example, the Turkish words göz (eye), gözlük (eyeglasses), gözlükçü (optician) and gözlem (observation) may all be stemmed from a single word "göz (eye)" losing the semantical information [41,46]. Interestingly, it is apparent that the given Turkish words have distinct English equivalents and this may be a concise comparison of two languages in terms of analysis complexity.…”

Section: Turkish Language Modelling Challenges Based On Its Morphological Complexitymentioning

confidence: 99%

See 2 more Smart Citations

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

et al. 2021

View full text Add to dashboard Cite

Language model pre-training architectures have demonstrated to be useful to learn language representations. bidirectional encoder representations from transformers (BERT), a recent deep bidirectional self-attention representation from unlabelled text, has achieved remarkable results in many natural language processing (NLP) tasks with fine-tuning. In this paper, we want to demonstrate the efficiency of BERT for a morphologically rich language, Turkish. Traditionally morphologically difficult languages require dense language pre-processing steps in order to model the data to be suitable for machine learning (ML) algorithms. In particular, tokenization, lemmatization or stemming and feature engineering tasks are needed to obtain an efficient data model to overcome data sparsity or high-dimension problems. In this context, we selected five various Turkish NLP research problems as sentiment analysis, cyberbullying identification, text classification, emotion recognition and spam detection from the literature. We then compared the empirical performance of BERT with the baseline ML algorithms. Finally, we found enhanced results compared to base ML algorithms in the selected NLP problems while eliminating heavy pre-processing tasks.

show abstract

Section: Turkish Language Modelling Challenges Based On Its Morphological Complexitymentioning

confidence: 99%

Section: Turkish Language Modelling Challenges Based On Its Morphological Complexitymentioning

confidence: 99%

Section: Turkish Language Modelling Challenges Based On Its Morphological Complexitymentioning

confidence: 99%

See 1 more Smart Citation