Clustering validation has long been recognized as one of the vital issues essential to the success of clustering applications. In general, clustering validation can be categorized into two classes, external clustering validation and internal clustering validation. In this paper, we focus on internal clustering validation and present a study of 11 widely used internal clustering validation measures for crisp clustering. The results of this study indicate that these existing measures have certain limitations in different application scenarios. As an alternative choice, we propose a new internal clustering validation measure, named clustering validation index based on nearest neighbors (CVNN), which is based on the notion of nearest neighbors. This measure can dynamically select multiple objects as representatives for different clusters in different situations. Experimental results show that CVNN outperforms the existing measures on both synthetic data and real-world data in different application scenarios.
Objects that are interrelated with each other are often represented as homogeneous networks, in which objects are of the same entity type and relationships between objects are of the same relationship type. However, heterogeneous information networks, composed of multiple types of objects and/or relationships, are ubiquitous in real life. Mining heterogeneous information networks is a new and promising field of research in data mining, and clustering is an important way to identify underlying patterns in data. Although clustering on homogeneous networks has been studied for several decades, clustering on heterogeneous networks has been explored only recently. However, some progress has already been made with respect to this theme, ranging from algorithms to various related applications. This paper presents a brief summary of current research regarding heterogeneous network clustering and addresses some promising research directions. First, it presents a formalized definition and two important aspects of heterogeneous information networks to elaborate why clustering on heterogeneous networks is of significance. Then, this review provides a concise classification of existing heterogeneous network clustering algorithms based on their methodological principles. In addition, it discusses experimental developments and applications of heterogeneous network clustering. The paper addresses several open problems and critical issues for future research. WIREs Data Mining Knowl Discov 2014, 4:213–233. doi: 10.1002/widm.1126
This article is categorized under:
Algorithmic Development > Structure Discovery
Technologies > Computational Intelligence
Technologies > Structure Discovery and Clustering
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.