Text categorization based on k-nearest neighbor approach for Web site classification

Kwon, Oh-Woog; Lee, Jong-Hyeok

doi:10.1016/s0306-4573(02)00022-5

Cited by 118 publications

(55 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As regards our own work, we achieved an overall accuracy of 83.5% using 401 documents (of varied lengths) with 18 categories by applying the kNN in a novel way. This percentage is still higher than in comparable works [10,34,35].…”

Section: Experiments Results and Evaluationcontrasting

confidence: 66%

“…The kNN classifier is a relatively simple algorithm compared to more complex approaches like artificial neural networks or support vector machines [9]. This simplicity, robustness, flexibility, and reasonably high accuracies have been exploited in diverse fields such as patent research [10], medical research [11], astrophysics [12], bioinformatics [13], and text categorisation [14,15]. The drawback of kNN lies in the expensive testing of each instance as every new instance must be compared with the whole dataset.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An innovative multi-segment strategy for the classification of legal judgments using the k-nearest neighbour classifier

Pudaruth

Soyjaudah

Gunputh

2017

Complex Intell. Syst.

View full text Add to dashboard Cite

The classification of legal documents has been receiving considerate attention over the last few years. This is mainly because of the over-increasing amount of legal information that is being produced on a daily basis in the courts of law. In the Republic of Mauritius alone, a total of 141,164 cases were lodged in the different courts in the year 2015. The Judiciary of Mauritius is becoming more efficient due to a number of measures which were implemented and the number of cases disposed of in each year has also risen significantly; however, this is still not enough to catch up with the increase in the number of new cases that are lodged. In this paper, we used the k-nearest neighbour machine learning classifier in a novel way. Unlike news article, judgments are complex documents which usually span several pages and contains a variety of information about a case. Our approach consists of splitting the documents into equal-sized segments. Each segment is then classified independently of the others. The selection of the predicted category is then done through a plurality voting procedure. Using this novel approach, we have been able to classify law cases with an accuracy of over 83.5%, which is 10.5% higher than when using the whole documents dataset. To the best of our knowledge, this type of process has never been used earlier to categorise legal judgments or other types of documents. In this work, we also

show abstract

Section: Experiments Results and Evaluationcontrasting

confidence: 66%

Section: Introductionmentioning

confidence: 99%

An innovative multi-segment strategy for the classification of legal judgments using the k-nearest neighbour classifier

Pudaruth

Soyjaudah

Gunputh

2017

Complex Intell. Syst.

View full text Add to dashboard Cite

show abstract

“…Identifying the topic of knowledge is important in that the topic (or the keyword) indicates the subject of knowledge embedded in the document. To extract the topic of knowledge based on predefined knowledge categories, text mining techniques can be employed [12].…”

Section: Support Vector Machines As the Classifiermentioning

confidence: 99%

Capture Knowledge on the Spot: Toward the Autonomous and Pervasive Service of Context-Rich Knowledge

Yoo

2013

Automatika

View full text Add to dashboard Cite

Original scientific paperKnowledge must be acquired not only at the moment when it is presented, but also at the site where it is applied to. To guarantee the immediate acquisition of context-rich knowledge at anytime and anywhere, fullyautomated as well as pervasive capabilities must be considered together. This paper proposes a methodology to capture knowledge on the spot in an autonomous and pervasive manner by deploying the Smartphone as a sensor to monitor and gather dialogue-based knowledge and context data. Smart-ConKAS (SMARTphone-based CONtextual Knowledge Acquisition System), a prototype system, is implemented to validate the proposed concepts.Key words: Automated knowledge acquisition, Pervasive computing, Autonomous computing, Cloud computing, Smartphone, Knowledge management Uhvatite znanje na licu mjesta: ususret autonomnoj i prožimajućoj usluzi sadržajnog znanja. Znanje se stječe ne samo u trenutku kada je predstavljeno nego i na mjestu gdje se primjenjuje. Kako bi se jamčilo trenutno stjecanje znanja bilo kada i bilo gdje moraju se uzeti u obzir potpuno automatizirane i prožimajuće sposobnosti. U ovom radu predložena je metoda stjecanja znanja na licu mjesta na autonoman i prožimajuć način korištenjem pametnog telefona kao senzora za nadgledanje i skupljanje znanja i podataka. Smart -ConKAS (SMARTphonebased CONtextual Knowledge Acquisition System) je prototip koji je korišten kako bi se potvrdio predloženi koncept.Ključne riječi: automatizirano stjecanje znanja, prožimajuće računarstvo, autonomno računarstvo, računarstvo u oblaku, pametni telefon, upravljanje znanjem

show abstract

“…The most well-known unsupervised term weighting method is TFIDF [15]. The following supervised term weighting methods are also considered in the paper: Gain Ratio (GR) [3], Confident Weights (CW) [10], Term Second Moment (TM2) [22], Relevance Frequency (RF) [11], Term Relevance Ratio (TRR) [9], and Novel Term Weighting (NTW) [18]; these methods involve information about the classes of the documents. As a rule, the dimensionality for text classification problems is high even after stop-words filtering and stemming.…”

Section: Introductionmentioning

confidence: 99%

“…Some comparative studies of machine learning algorithms in the field of text classification showed high classification effectiveness of k-NN, SVM-based algorithms, and ANN [2,7,8,10,13].…”

Section: Introductionmentioning

confidence: 99%

Feature Selection for Natural Language Call Routing Based on Self-Adaptive Genetic Algorithm

Коромыслова

Semenkina

Sergienko

2017

IOP Conf. Ser.: Mater. Sci. Eng.

View full text Add to dashboard Cite

Abstract:The text classification problem for natural language call routing was considered in the paper. Seven different term weighting methods were applied. As dimensionality reduction methods, the feature selection based on self-adaptive GA is considered. k-NN, linear SVM and ANN were used as classification algorithms. The tasks of the research are the following: perform research of text classification for natural language call routing with different term weighting methods and classification algorithms and investigate the feature selection method based on self-adaptive GA. The numerical results showed that the most effective term weighting is TRR. The most effective classification algorithm is ANN. Feature selection with self-adaptive GA provides improvement of classification effectiveness and significant dimensionality reduction with all term weighting methods and with all classification algorithms. IntroductionNatural language call routing is an important problem in the design of modern automatic call services and the solving of this problem could lead to improvement of the call service [21]. Generally natural language call routing can be considered as two different problems. The first one is speech recognition of calls and the second one is topic categorization of users utterances for further routing. Topic categorization of users utterances can be also useful for multidomain spoken dialogue system design [12]. In this work we treat call routing as an example of a text classification application.In the vector space model [16] text classification is considered as a machine learning problem. The complexity of text categorization with a vector space model is compounded by the need to extract the numerical data from text information before applying machine learning algorithms. Therefore, text classification consists of two parts: text preprocessing and classification algorithm application using the obtained numerical data. Text preprocessing comprises three stages:-Textual feature extraction.-Term weighting -Dimensionality reduction. The first one is the textual feature extraction based on raw preprocessing of the documents. This process includes deleting punctuation, transforming capital letters to lowercase, and additional procedures such as stop-words filtering [4] and stemming [14]. Stop-words list contains pronouns, prepositions, articles and other words that usually have no importance for the classification. Using stemming it is possible to join different forms of the same word into one textual feature.The second stage is the numerical feature extraction based on term weighting. For term weighting we use "bag-of-words" model, in which the word order is ignored. There exist different unsupervised and supervised term weighting methods. The most well-known unsupervised term weighting method is TFIDF [15]. The following supervised term weighting methods are also considered in the paper:

show abstract

Text categorization based on k-nearest neighbor approach for Web site classification

Cited by 118 publications

References 17 publications

An innovative multi-segment strategy for the classification of legal judgments using the k-nearest neighbour classifier

An innovative multi-segment strategy for the classification of legal judgments using the k-nearest neighbour classifier

Capture Knowledge on the Spot: Toward the Autonomous and Pervasive Service of Context-Rich Knowledge

Feature Selection for Natural Language Call Routing Based on Self-Adaptive Genetic Algorithm

Contact Info

Product

Resources

About