Subspace clustering is a challenging high-dimensional data mining task. There have been several approaches proposed in the literature to identify clusters in subspaces, however their performance and quality is highly affected by input parameters. A little research is done so far on identifying proper parameter values automatically. Other observed drawbacks are requirement of multiple database scans resulting into increased demand for computing resources and generation of many redundant clusters. Here, we propose a parameter light subspace clustering method for numerical data hereafter referred to as CLUSLINK. The algorithm is based on single linkage clustering method and works in bottom up, greedy fashion. The only input user has to provide is how coarse or fine the resulting clusters should be, and if not given, the algorithm operates with default values. The empirical results obtained over synthetic and real benchmark datasets show significant improvement in terms of accuracy and execution time.
Many real world datasets may contain missing values for various reasons. These incomplete datasets can pose severe issues to the underlying machine learning algorithms and decision support systems. It may result in high computational cost, skewed output and invalid deductions. Various solutions exist to mitigate this issue; the most popular strategy is to estimate the missing values by applying inferential techniques such as linear regression, decision trees or Bayesian inference. In this paper, the missing data problem is discussed in detail with a comprehensive review of the approaches to tackle it. The paper concludes with a discussion on the effectiveness of three imputation methods namely, imputation based on Multiple Linear Regression (MLR), Predictive Mean Matching (PMM) and Classification And Regression Tree (CART) in the context of subspace clustering. The experimental results obtained on real benchmark datasets and high-dimensional synthetic datasets highlight that, MLR based imputation method is more efficient on high-dimensional incomplete datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.