Micro-Batching Growing Neural Gas for Clustering Data Streams Using Spark Streaming

Ghesmoune, Mohammed; Lebbah, Mustapha; Azzag, Hanene

doi:10.1016/j.procs.2015.07.290

Cited by 18 publications

(9 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Among them, Spark [20] highlights as one of the most flexible and powerful engines to performed faster distributed computing in big data by using in-memory primitives. This platform allows user programs to load data into memory and query it repeatedly, making it more suitable for online, iterative or data streams algorithms [21].…”

Section: A C C E P T E D Mmentioning

confidence: 99%

kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data

Maillo

Ramírez

Triguero

et al. 2017

Knowledge-Based Systems

267

126

View full text Add to dashboard Cite

A note on versions:The version presented here may differ from the published version or from the version of record. If you wish to cite this item you are advised to consult the publisher's version. Please see the repository url above for details on accessing the published version and note that access may require a subscription. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.ACCEPTED MANUSCRIPT A C C E P T E D M A N U S C R I P T AbstractThe k-Nearest Neighbors classifier is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies.In this work we provide a new solution to perform an exact k-nearest neighbor classification based on Spark. We take advantage of its in-memory operations to classify big amounts of unseen cases against a big training dataset. The map phase computes the k-nearest neighbors in different training data splits. Afterwards, multiple reducers process the definitive neighbors from the list obtained in the map phase. The key point of this proposal lies on the management of the test set, keeping it in memory when possible. Otherwise, it is split into a minimum number of pieces, applying a MapReduce per chunk, using the caching skills of Spark to reuse the previously partitioned * Corresponding author. ACCEPTED MANUSCRIPT A C C E P T E D M A N U S C R I P T training set. In our experiments we study the differences between Hadoop and Spark implementations with datasets up to 11 million instances, showing the scaling-up capabilities of the proposed approach. As a result of this work an open-source Spark package is available.

show abstract

Section: A C C E P T E D Mmentioning

confidence: 99%

kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data

Maillo

Ramírez

Triguero

et al. 2017

Knowledge-Based Systems

267

126

View full text Add to dashboard Cite

show abstract

“…Moreover, it cannot deal with categorical data with the conventional principal component analysis. Although other distributed clustering algorithms for heterogeneous datasets are proposed, e.g., OPTICS algorithm [21] and the SDBDC algorithm [17], these methods assume clusters of similar density, and may have problems separating nearby clusters [22] and the appropriate choice of parameters, such as the radius parameter, which is still an open issue [23]. In sum, none of the existing algorithms adequately address the problems we have outlined here.…”

Section: Distributed Clustering Algorithmsmentioning

confidence: 99%

Double Deep Autoencoder for Heterogeneous Distributed Clustering

Chen

Huang

2019

Information

View full text Add to dashboard Cite

Given the issues relating to big data and privacy-preserving challenges, distributed data mining (DDM) has received much attention recently. Here, we focus on the clustering problem of distributed environments. Several distributed clustering algorithms have been proposed to solve this problem, however, previous studies have mainly considered homogeneous data. In this paper, we develop a double deep autoencoder structure for clustering in distributed and heterogeneous datasets. Three datasets are used to demonstrate the proposed algorithms, and show their usefulness according to the consistent accuracy index.

show abstract

“…However, the design of a "distributed" version of G-Stream would raise difficulties, which are resolved by MBG-Stream [35]. This later operates with parameters to control the decay (or "forgetfulness") of the estimates.…”

Section: G-streammentioning

confidence: 99%

“…In the streaming clustering point of view, Spartakus 2 is an open-source project on top of Spark-notebook 3 which provides front-end packages for some clustering algorithms implemented using the MapReduce framework. This includes the MBG-Stream 4 algorithm [35] (detailed in "Background" section) with an integrated interface for execution and visualization checks. MLlib [64] gives implementations of some clustering algorithms, especially a Streaming k-means 5 open-source code.…”

Section: Spark Streamingmentioning

confidence: 99%

State-of-the-art on clustering data streams

2016

Self Cite

View full text Add to dashboard Cite

Clustering is a key data mining task. This is the problem of partitioning a set of observations into clusters such that the intra-cluster observations are similar and the inter-cluster observations are dissimilar. The traditional setup where a static dataset is available in its entirety for random access is not applicable as we do not have the entire dataset at the launch of the learning, the data continue to arrive at a rapid rate, we can not access the data randomly, and we can make only one or at most a small number of passes on the data in order to generate the clustering results. These types of data are referred to as data streams. The data stream clustering problem requires a process capable of partitioning observations continuously while taking into account restrictions of memory and time. In the literature of data stream clustering methods, a large number of algorithms use a two-phase scheme which consists of an online component that processes data stream points and produces summary statistics, and an offline component that uses the summary data to generate the clusters. An alternative class is capable of generating the final clusters without the need of an offline phase. This paper presents a comprehensive survey of the data stream clustering methods and an overview of the most well-known streaming platforms which implement clustering.

show abstract

Micro-Batching Growing Neural Gas for Clustering Data Streams Using Spark Streaming

Cited by 18 publications

References 8 publications

kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data

kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data

Double Deep Autoencoder for Heterogeneous Distributed Clustering

State-of-the-art on clustering data streams

Contact Info

Product

Resources

About