Introducing and Comparing Recent Clustering Methods for Massive Data Management in the Internet of Things

Guyeux, Christophe; Chrétien, Stéphane; Tayeh, Gaby Bou; Demerjian, Jacques; Bahi, Jacques M.

doi:10.3390/jsan8040056

Cited by 16 publications

(5 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…K-means partitions data into k distinct clusters based on distance to the centroid of a cluster, which have been successfully applied to the analysis of Raman spectra from biological samples such as breast cancer ( Kothari et al, 2021 ), colonic cancer ( Beljebbar et al, 2009 ), and macromolecules ( Pahlow et al, 2018 ). As for the DBSCAN algorithm, it is a density-based clustering that looks for high-density areas and extends clusters from them ( Guyeux et al, 2019 ). Thus, the pre-set number of clusters is not required.…”

Section: Resultsmentioning

confidence: 99%

Comparative Analysis of Machine Learning Algorithms on Surface Enhanced Raman Spectra of Clinical Staphylococcus Species

et al. 2021

View full text Add to dashboard Cite

Raman spectroscopy (RS) is a widely used analytical technique based on the detection of molecular vibrations in a defined system, which generates Raman spectra that contain unique and highly resolved fingerprints of the system. However, the low intensity of normal Raman scattering effect greatly hinders its application. Recently, the newly emerged surface enhanced Raman spectroscopy (SERS) technique overcomes the problem by mixing metal nanoparticles such as gold and silver with samples, which greatly enhances signal intensity of Raman effects by orders of magnitudes when compared with regular RS. In clinical and research laboratories, SERS provides a great potential for fast, sensitive, label-free, and non-destructive microbial detection and identification with the assistance of appropriate machine learning (ML) algorithms. However, choosing an appropriate algorithm for a specific group of bacterial species remains challenging, because with the large volumes of data generated during SERS analysis not all algorithms could achieve a relatively high accuracy. In this study, we compared three unsupervised machine learning methods and 10 supervised machine learning methods, respectively, on 2,752 SERS spectra from 117 Staphylococcus strains belonging to nine clinically important Staphylococcus species in order to test the capacity of different machine learning methods for bacterial rapid differentiation and accurate prediction. According to the results, density-based spatial clustering of applications with noise (DBSCAN) showed the best clustering capacity (Rand index 0.9733) while convolutional neural network (CNN) topped all other supervised machine learning methods as the best model for predicting Staphylococcus species via SERS spectra (ACC 98.21%, AUC 99.93%). Taken together, this study shows that machine learning methods are capable of distinguishing closely related Staphylococcus species and therefore have great application potentials for bacterial pathogen diagnosis in clinical settings.

show abstract

Section: Resultsmentioning

confidence: 99%

Comparative Analysis of Machine Learning Algorithms on Surface Enhanced Raman Spectra of Clinical Staphylococcus Species

et al. 2021

View full text Add to dashboard Cite

show abstract

“…The fourteen chosen algorithms for consideration (and their implementations) are as follows: Hierarchical (Ward’s) [ 19 , 45 ], Hierarchical (Single Link) [ 19 ], BIRCH (Balanced Iterative Reducing and Clustering) [ 46 , 47 ], k -means [ 48 – 50 ], k -means minibatch [ 49 , 51 ], Partitioning around Medoids (PAM) [ 52 ], DBSCAN (Density-based Spatial Clustering of Applications with Noise) [ 49 , 53 ], OPTICS (Ordering Points to Identify Clustering Structure) [ 49 , 54 ], Mean Shift [ 49 , 55 ], Spectral Clustering [ 49 , 56 , 57 ], Affinity Propagation [ 49 , 57 , 58 ], and Gaussian Mixture Model [ 57 , 59 ] were implemented using the scikit-learn Python package ( https://scikit-learn.org/stable/modules/clustering.html ). Fuzzy C-Means [ 60 , 61 ] was implemented using the Fuzzy C-Means Python package [ 62 ] ( https://git.io/fuzzy-c-means ).…”

Section: Methodsmentioning

confidence: 99%

An analysis framework for clustering algorithm selection with applications to spectroscopy

Crase

Thennadil

2022

PLoS ONE

View full text Add to dashboard Cite

Cluster analysis is a valuable unsupervised machine learning technique that is applied in a multitude of domains to identify similarities or clusters in unlabelled data. However, its performance is dependent of the characteristics of the data it is being applied to. There is no universally best clustering algorithm, and hence, there are numerous clustering algorithms available with different performance characteristics. This raises the problem of how to select an appropriate clustering algorithm for the given analytical purposes. We present and validate an analysis framework to address this problem. Unlike most current literature which focuses on characterizing the clustering algorithm itself, we present a wider holistic approach, with a focus on the user’s needs, the data’s characteristics and the characteristics of the clusters it may contain. In our analysis framework, we utilize a softer qualitative approach to identify appropriate characteristics for consideration when matching clustering algorithms to the intended application. These are used to generate a small subset of suitable clustering algorithms whose performance are then evaluated utilizing quantitative cluster validity indices. To validate our analysis framework for selecting clustering algorithms, we applied it to four different types of datasets: three datasets of homemade explosives spectroscopy, eight datasets of publicly available spectroscopy data covering food and biomedical applications, a gene expression cancer dataset, and three classic machine learning datasets. Each data type has discernible differences in the composition of the data and the context within which they are used. Our analysis framework, when applied to each of these challenges, recommended differing subsets of clustering algorithms for final quantitative performance evaluation. For each application, the recommended clustering algorithms were confirmed to contain the top performing algorithms through quantitative performance indices.

show abstract

“…The fowlkes-mallows index is a metric that assesses the similarity between two clusters by comparing the clustering result to a known ground truth partition [24]. FMI produces a similarity score ranging from 0 to 1, with a higher score indicating a higher similarity between the two clusters.…”

Section: Fowlkes-mallows Indexmentioning

confidence: 99%

Clustering performance using k-modes with modified entropy measure for breast cancer

Mahfuz,

Suhartanto,

Kusmardi

et al. 2023

IJEECS

View full text Add to dashboard Cite

<span>Breast cancer is a serious disease that requires data analysis for diagnosis and treatment. Clustering is a data mining technique that is often used in breast cancer research to assess the level of malignancy at an early stage. However, clustering categorical data can be challenging because different levels in categorical variables can impact the clustering process. This research proposes a modified entropy measure (MEM) to enhance clustering performance. MEM aims to address the issue of distance-based measures in clustering categorical data. It is also a useful tool for assessing data loss in categorical clustering, which helps to understand the patterns and relationships by quantifying the information lost during clustering. An evaluation compares k-modes+MEM, k-means+MEM, DBSCAN+MEM, and affinity+MEM with conventional clustering algorithms. The assessment metrics of clustering accuracy, intra-cluster distance and fowlkes-mallow index (FMI) are employed to evaluate the algorithm performance. Experimental results show significant improvements. k-Modes+MEM algorithm achieves a reduction in average intra-cluster distance and outperforms other algorithms in accuracy, intra-cluster distance, and FMI. The proposed algorithm can be extended to heterogeneous datasets in various domains such as healthcare, finance, and marketing.</span>

show abstract

Introducing and Comparing Recent Clustering Methods for Massive Data Management in the Internet of Things

Cited by 16 publications

References 47 publications

Comparative Analysis of Machine Learning Algorithms on Surface Enhanced Raman Spectra of Clinical Staphylococcus Species

Comparative Analysis of Machine Learning Algorithms on Surface Enhanced Raman Spectra of Clinical Staphylococcus Species

An analysis framework for clustering algorithm selection with applications to spectroscopy

Clustering performance using k-modes with modified entropy measure for breast cancer

Contact Info

Product

Resources

About