Clustering is one of the major issues in data mining. Data labeling has been recognized as an important method in categorical clustering. Clustering is technique where all similar data point are grouped. However, with data labeling is applied on those points which are not labeled earlier. Although there are many approaches in the numerical domain, but very limited algorithms are available for categorical data. To address this problem of how to allocate those unlabeled data points into proper clusters remains as a challenging issue in the categorical domain. In this paper, a mechanism is proposed for labeling and keeping the similar data points into accurate clusters. We have a data set named Genome DNA where grouping of 'superfluous' Splice junctions on those points on a DNA sequence is a major challenge. The predicament posed in this dataset is to recognize, given a sequence of DNA, the limits between exons and introns. The new proposal is to allocate each unlabeled data point into the equivalent proper cluster with data labeling also. This method has two advantages: 1) The proposed method exhibits high execution efficiency. 2) This method can achieve quality clusters. The proposed method is empirically validated on DNA data set, and it is shown significantly more efficient than prior works while attaining results of high quality. Keywords-Clustering; Categorical Data; Clustering; Data Labeling; Outlier; Entropy; Rough set;.I. INTRODUCTION In Data Mining [2] clustering is a major challenge. It is used to group similar objects as one [1,3]. These kinds of groups are often known as clusters. The extent of grouping mechanisms have been complete in Information Retrieval Systems, Medical diagnosis, statistics, and pattern recognition and machine learning, etc. The complete extent on clustering procedure can be originating in [3] various types. Numeric, Mixed and categorical data are the different types in data set. For Numeric data greater type of procedures are available when compared to other two [5][6] data types. In categorical data clustering is a complicated task, where the distance between data points is not accurate, when the data is increased on time. Clustering an enormous data set is a difficult concern in its intricacy it poses and the time it takes for the process. [7,8] In clustering sampling is another method used to pick up the capability of clustering by selecting some data points arbitrarily for early clustering and regard as the data points which are un labeled (that are not sampled and are not clustered) to opt for customs and means to allot them into suitable clusters. This is called cluster labeling [9, 10, and 11].In categorical field numerical field is not that much straight forward in finding the class field. In Data Mining, concept Drift is time overwhelming. [12,16]. The time budding data in the numerical field for clustering [1,5,6,10] has been explored in the last study literature, however not much more was addressed in categorical domain. So, still it is a main trouble in the ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.