State-of-the-art on clustering data streams

Ghesmoune, Mohammed; Lebbah, Mustapha; Azzag, Hanene

doi:10.1186/s41044-016-0011-3

Cited by 69 publications

(33 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Four most commonly used data structures are feature vectors, prototype arrays, coreset trees and grids. Feature vectors keep the summary of the data instances, prototype arrays keep only a number of representative instances that exemplify the data, coreset trees keep the summary in a tree structure and grids keep the data density in the feature space (Ghesmoune et al, 2016;Mansalis et al, 2018;Silva et al, 2013).…”

Section: Data Structures For Data Streamsmentioning

confidence: 99%

“…Partitioning based algorithms have an easy implementation in general. StreamLSearch (O'Callaghan et al, 2002), incremental k-means (Ordonez, 2003), CluStream (Aggarwal et al, 2003), HP-Stream (Aggarwal et al, 2004), SWClustering (Zhou et al, 2008), StreamKM++ (Ackermann et al, 2012), strAP (Zhang et al, 2014) and CLARA (Kaufman and Rousseeuw, 1990) are partitioning based algorithms (Ghesmoune et al, 2016;Kumar, 2016;Mousavi et al, 2015). -Grid based algorithms use grid data structure.…”

Section: Stream Clustering Algorithmsmentioning

confidence: 99%

See 1 more Smart Citation

Data stream clustering: a review

2020

View full text Add to dashboard Cite

Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed. Keywords Data streams • Data stream clustering • Real-time clustering • 1 Introduction More devices including sensors are becoming interconnected and interconnected devices continuously generate streams of data at high speed. Offline processing of

show abstract

Section: Data Structures For Data Streamsmentioning

confidence: 99%

Section: Stream Clustering Algorithmsmentioning

confidence: 99%

Data stream clustering: a review

2020

View full text Add to dashboard Cite

show abstract

“…It is an incremental and dynamic clustering algorithm that follows a hierarchical clustering technique for databases by incrementally constructing a clustering feature (CF) tree, which is a subcluster of data points or better described as a tree-like representation of data points in a data set. 22 Best clustering is achieved by multi-scanning, and having more available memory which maximizes good result. 11 BIRCH is an incremental clustering algorithm that has 4 phases.…”

Section: Balanced Iterative and Clustering Using Hierarchiesmentioning

confidence: 99%

Gene-Based Clustering Algorithms: Comparison Between Denclue, Fuzzy-C, and BIRCH

Nwadiugwu

2020

Bioinform Biol Insights

View full text Add to dashboard Cite

The current study seeks to compare 3 clustering algorithms that can be used in gene-based bioinformatics research to understand disease networks, protein-protein interaction networks, and gene expression data. Denclue, Fuzzy-C, and Balanced Iterative and Clustering using Hierarchies (BIRCH) were the 3 gene-based clustering algorithms selected. These algorithms were explored in relation to the subfield of bioinformatics that analyzes omics data, which include but are not limited to genomics, proteomics, metagenomics, transcriptomics, and metabolomics data. The objective was to compare the efficacy of the 3 algorithms and determine their strength and drawbacks. Result of the review showed that unlike Denclue and Fuzzy-C which are more efficient in handling noisy data, BIRCH can handle data set with outliers and have a better time complexity.

show abstract

“…In Ghesmoune et al (2016) the authors discuss 19 algorithms and are among the first to highlight the research area of Neural Gas (NG) for stream clustering. However, only a single grid-based algorithm is discussed and other popular algorithms are missing.…”

Section: Related Workmentioning

confidence: 99%

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

Carnein

Trautmann

2019

Bus Inf Syst Eng

View full text Add to dashboard Cite

Analyzing data streams has received considerable attention over the past decades due to the widespread usage of sensors, social media and other streaming data sources. A core research area in this field is stream clustering which aims to recognize patterns in an unordered, infinite and evolving stream of observations. Clustering can be a crucial support in decision making, since it aims for an optimized aggregated representation of a continuous data stream over time and allows to identify patterns in large and high-dimensional data. A multitude of algorithms and approaches has been developed that are able to find and maintain clusters over time in the challenging streaming scenario. This survey explores, summarizes and categorizes a total of 51 stream clustering algorithms and identifies core research threads over the past decades. In particular, it identifies categories of algorithms based on distance thresholds, density grids and statistical models as well as algorithms for high dimensional data. Furthermore, it discusses applications scenarios, available software and how to configure stream clustering algorithms. This survey is considerably more extensive than comparable studies, more up-to-date and highlights how concepts are interrelated and have been developed over time.

show abstract

State-of-the-art on clustering data streams

Cited by 69 publications

References 52 publications

Data stream clustering: a review

Data stream clustering: a review

Gene-Based Clustering Algorithms: Comparison Between Denclue, Fuzzy-C, and BIRCH

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

Contact Info

Product

Resources

About