Metin Sınıflandırma Doğal Dil İşleme (DDİ) alanında önemli bir yere sahiptir. Son zamanlarda metinsel verilerin artması ve otomatik etiketlenmesi gerekliliği, metin sınıflandırma probleminin önemini artırmıştır. Geleneksel yaklaşımlardan öne çıkan kelime torbası yöntemi yıllardır metin sınıflandırmasında başarılı olmaktadır. Son zamanlarda sinir ağları dil modelleri DDİ problemlerine başarılı bir şekilde uygulanmış ve bazı alanlarda büyük başarı kaydetmişlerdir. Yapay Sinir Ağları (YSA) temelli mimarilerin en önemli avantajı daha etkili kelime ve metin gösterilimlerin oluşturmasıdır. Bu gösterilimler, geleneksel yöntemlere göre daha az boyutlu ve daha etkili bulunmuştur. Özellikle anlambilimsel ve sözdizimsel analizlerde başarılı uygulamalar yapılmıştır. Öte yandan daha uzun vektörlerle gösterilim kullanan geleneksel kelime torbası yöntemleri, metin gösterilimleri anlamında hala gücünü korumaktadır. Ancak Türkçe için bu iki yaklaşımın herhangi bir karşılaştırılması yapılmamıştır. Bu çalışmada, geleneksel kelime torbası yaklaşımı ile sinir ağı temelli yeni gösterilim yaklaşımları metin sınıflandırması açısından karşılaştırılmıştır. Bu çalışmalarda gördük ki etkili özellik seçimleri geleneksel yöntemlerinin hala yeni kuşak kelime gömme (word embeddings) yaklaşımı ile yarışacak düzeydedir. Son olarak deneylerimizi bu iki yaklaşım açısından çeşitlendirerek raporladık ve Türkçe için başarılı metin sınıflandırma mimarisini bu raporda ayrıntılı tartıştık. Text categorization plays important role in the field of Natural Language Processing. Recently, the rapid growth in the amount of textual data and requirement of automatic annotation makes the problem of text categorization more important. As a prominent one of the traditional methods, the bag-of-words approach has been successfully applied to text categorization problem for years. Recently, Neural Network Language Models (NNLM) have achieved successful results for various problems of Natural Language Processing (NLP). The most important advantage of the NNLM is to provide effective word and document representations. Those representations are lower dimensional and are found to be more effective than traditional methods. They have been exploited successfully for semantic and syntactic analysis. On the other hand, the traditional bag-of-words approaches that use one-hot long vector representation are still considered powerful in terms of their accuracy in document classification. However, comparing these approaches for Turkish language has not been attempted before. In this study, we compared them within a variety of analysis. We observed that the traditional bagof-word representation utilizing an effective feature selection and a machine learning algorithm aligned with it have comparable performance with new generation vector based methods, namely word embeddings. In this study, we have conducted various experiments comparing these approaches and designated an effective text categorization architecture for Turkish Language.
Recently, Neural Network Language Models have been effectively applied to many types of Natural Language Processing (NLP) tasks. One popular type of tasks is the discovery of semantic and syntactic regularities that support the researchers in building a lexicon. Word embedding representations are notably good at discovering such linguistic regularities. We argue that two supervised learning approaches based on word embeddings can be successfully applied to the hypernym problem, namely, utilizing embedding offsets between word pairs and learning semantic projection to link the words. The offset-based model classifies offsets as hypernym or not. The semantic projection approach trains a semantic transformation matrix that ideally maps a hyponym to its hypernym. A semantic projection model can learn a projection matrix provided that there is a sufficient number of training word pairs. However, we argue that such models tend to learn is-a-particular-hypernym relation rather than to generalize is-a relation. The embeddings are trained by applying both the Continuous Bag-of Words and the Skip-Gram training models using a huge corpus in Turkish text. The main contribution of the study is the development of a novel and efficient architecture that is well-suited to applying word embeddings approaches to the Turkish language domain. We report that both the projection and the offset classification models give promising and novel results for the Turkish Language.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.