We define a heterogeneous dataset as a set of complex objects, that is, those defined by several data types including structured data, images, free text or time series. We envisage this could be extensible to other data types. There are currently research gaps in how to deal with such complex data. In our previous work, we have proposed an intermediary fusion approach called SMF which produces a pairwise matrix of distances between heterogeneous objects by fusing the distances between the individual data types. More precisely, SMF aggregates partial distances that we compute separately from each data type, taking into consideration uncertainty. Consequently, a single fused distance matrix is produced that can be used to produce a clustering using a standard clustering algorithm. In this paper we extend the practical work by evaluating SMF using the k-means algorithm to cluster heterogeneous data. We used a dataset of prostate cancer patients where objects are described by two basic data types, namely: structured and time-series data. We assess the results of clustering using external validation on multiple possible classifications of our patients. The result shows that the SMF approach can improved the clustering configuration when compared with clustering on an individual data type
Abstract. This paper introduces Hk -medoids, a modified version of the standard kmedoids algorithm. The modification extends the algorithm for the problem of clustering complex heterogeneous objects that are described by a diversity of data types, e.g. text, images, structured data and time series. We first proposed an intermediary fusion approach to calculate fused similarities between objects, SMF, taking into account the similarities between the component elements of the objects using appropriate similarity measures. The fused approach entails uncertainty for incomplete objects or for objects which have diverging distances according to the different component. Our implementation of Hk -medoids proposed here works with the fused distances and deals with the uncertainty in the fusion process. We experimentally evaluate the potential of our proposed algorithm using five datasets with different combinations of data types that define the objects. Our results show the feasibility of the our algorithm and also they show a performance enhancement when comparing to the application of the original SMF approach in combination with a standard k -medoids that does not take uncertainty into account. In addition, from a theoretical point of view, our proposed algorithm has lower computation complexity than the popular PAM implementation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.