The amount of data for analysis is increasing at a dramatic rate, for example web data. And so, it's important to improve techniques of searching relevant information from the huge data so as to increase efficiency. One such technique is text clustering, whereby we group (or cluster) text documents into various groups (or clusters), such as clustering web search engine results into meaningful groups. Data mining is a computer science area that can be defined as extraction of useful information from large structured data. Text mining on the other hand is an extension of data mining dealing only with (unstructured) text data. Text clustering is thus a text mining technique. In this paper, we give an insight of text clustering including the text mining related areas, techniques, and application areas. We also propose a framework for doing text clustering based on the K Means algorithm. The paper thus gives guidance to researchers of text mining concerning the state of art of text clustering.
Text Clustering is a problem of dividing text documents into groups, such that documents in one group are more similar than those in other groups. Although comparisons of the different algorithms have been done in an attempt to choose some over the others, such comparisons have been found to be either too limited or inadequate. In such comparisons, either the researchers (who are usually the authors of the algorithms being compared with others) did not apply a formal comparison methodology, or the comparisons were based on inadequate data, metrics and procedures.Also, the comparisons always focus on only the aspects where their algorithms are superior to the other algorithms. The few algorithms being compared with theirs obviously seem to be carefully selected such that they are the ones performing lesser than theirs on those aspects.Thus, there is still a large gap on the most suitable methodology for comparing the algorithms. In this paper, a methodology for fairly comparing text clustering algorithms is proposed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.