Online Embedding and Clustering of Data Streams

Zubaroğlu, Alaettin; Atalay, Volkan

doi:10.1145/3372454.3372481

Cited by 2 publications

(2 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We adjust these algorithms to apply them on data streams and we perform clustering on the embedded data. In this particular case, when compared against t-SNE, UMAP shows better performance in terms of execution time and silhouette score of the embedded data, and therefore, UMAP is more suitable for data streams [4]. This finding is also supported by Bahri et al [5] where they survey dimensionality reduction techniques and empirically compare five of them, as applied on data streams.…”

Section: Introductionmentioning

confidence: 54%

Online embedding and clustering of evolving data streams

Zubaroğlu

Atalay

2022

Statistical Analysis

Self Cite

View full text Add to dashboard Cite

Number of connected devices is steadily increasing and this trend is expected to continue in the near future. Connected devices continuously generate data streams and the data streams may often be high dimensional and contain concept drift. Clustering is one of the most suitable methods for real-time data stream processing, since clustering can be applied with less prior information about the data. Also, data embedding makes the visualization of high dimensional data possible and may simplify clustering process. There exist several data stream clustering algorithms in the literature; however, no data stream embedding method exists. Uniform Manifold Approximation and Projection (UMAP) is a data embedding algorithm that is suitable to be applied on stationary (stable) data streams, though it cannot adapt concept drift. In this study, we describe a novel method EmCStream, to apply UMAP on evolving (nonstationary) data streams, to detect and adapt concept drift and to cluster embedded data instances using a distance or partitioning-based clustering algorithm. We have evaluated EmCStream against the state-of-the-art stream clustering algorithms using both synthetic and real data streams containing concept drift. EmCStream outperforms DenStream and CluStream, in terms of clustering quality, on both synthetic and real evolving data streams. Datasets

show abstract

Section: Introductionmentioning

confidence: 54%

Online embedding and clustering of evolving data streams

Zubaroğlu

Atalay

2022

Statistical Analysis

Self Cite

View full text Add to dashboard Cite

show abstract

“…Notably, while Distance Consistency (DSC) [59] was designed for DR visual quality evaluation [19,56,58], it can also be viewed as a CVM since it considers only the separation of class labels in the embeddings. EVM-based evaluation Given Z, δ , P L , and a clustering technique C providing a partition P C = C(Z, δ ) of the embedded data, m E (P C , P L ) represents CLM between P L and Z. K-Means and the adjusted rand index are commonly used for C and m E , respectively [31,71,74].…”

Section: Using Cvm To Evaluate Clmmentioning

confidence: 99%

Classes are not Clusters: Improving Label-based Evaluation of Dimensionality Reduction

Jeon,

Kuo,

Aupetit

et al. 2023

IEEE Trans. Visual. Comput. Graphics

View full text Add to dashboard Cite

A common way to evaluate the reliability of dimensionality reduction (DR) embeddings is to quantify how well labeled classes form compact, mutually separated clusters in the embeddings. This approach is based on the assumption that the classes stay as clear clusters in the original high-dimensional space. However, in reality, this assumption can be violated; a single class can be fragmented into multiple separated clusters, and multiple classes can be merged into a single cluster. We thus cannot always assure the credibility of the evaluation using class labels. In this paper, we introduce two novel quality measures-Label-Trustworthiness and Label-Continuity (Label-T&C)-advancing the process of DR evaluation based on class labels. Instead of assuming that classes are well-clustered in the original space, Label-T&C work by (1) estimating the extent to which classes form clusters in the original and embedded spaces and (2) evaluating the difference between the two. A quantitative evaluation showed that Label-T&C outperform widely used DR evaluation measures (e.g., Trustworthiness and Continuity, Kullback-Leibler divergence) in terms of the accuracy in assessing how well DR embeddings preserve the cluster structure, and are also scalable. Moreover, we present case studies demonstrating that Label-T&C can be successfully used for revealing the intrinsic characteristics of DR techniques and their hyperparameters.

show abstract

Online Embedding and Clustering of Data Streams

Cited by 2 publications

References 10 publications

Online embedding and clustering of evolving data streams

Online embedding and clustering of evolving data streams

Classes are not Clusters: Improving Label-based Evaluation of Dimensionality Reduction

Contact Info

Product

Resources

About