In today's world tremendous amount of unstructured data, especially text, is being generated through various sources. This massive amount of data has lead the researchers to focus on employing data mining techniques to analyse and cluster them for an efficient browsing and searching mechanisms. The clustering methods like k-means algorithm perform through measuring the relationship between the data objects. Accurate clustering is based on the similarity or dissimilarity measure that is defined to evaluate the homogeneity of the documents. A variety of measures have been proposed up to this date. However, all of them are not suitable to be used in the k-means algorithm. In this paper, an extensive study is done to compare and analyse the performance of eight well-known similarity and dissimilarity measures that are applicable to the kmeans clustering approach. For experiment purpose, four text document data sets are used and the results are reported.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.