Streaming k-Means Clustering with Fast Queries

Tangwongsan, Kanat; Tirthapura, Srikanta

doi:10.1109/icde.2017.102

Cited by 15 publications

(7 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Clustering topics involves grouping similar topics into a set known as a cluster. The idea is that topics in one cluster are likely to be different compared to topics grouped under another cluster [19]. In other words,, topics in one cluster are more co-related than those in another.…”

Section: Related Workmentioning

confidence: 99%

Untitled

2022

IJNLC

View full text Add to dashboard Cite

Topic detection in dialogue datasets has become a significant challenge for unsupervised and unlabeled data to develop a cohesive and engaging dialogue system. In this paper, we proposed unsupervised and semi-supervised techniques for topic detection in the conversational dialogue dataset and compared them with existing topic detection techniques. The paper proposes a novel approach for topic detection, which takes preprocessed data as an input and performs similarity analysis with the TF-IDF scores bag of words technique (BOW) to identify higher frequency words from dialogue utterances. It then refines the higher frequency words by integrating the clustering and elbow methods and using the Parallel Latent Dirichlet Allocation (PLDA) model to detect the topics. The paper comprised a comparative analysis of the proposed approach on the Switchboard, Personachat and Mul-tiWOZ dataset. The experimental results show that the proposed topic detection approach performs significantly better using a semi-supervised dialogue dataset. We also performed topic quantification to check how accurate extracted topics are to compare with manually annotated data. For example, extracted topics from Switchboard are 92.72%, Peronachat 87.31% and MultiWOZ 93.15% accurate with manually annotated data.

show abstract

Section: Related Workmentioning

confidence: 99%

Untitled

2022

IJNLC

View full text Add to dashboard Cite

show abstract

“…We assume a general dynamic setting, where new objects can be inserted into a system and arbitrary objects can be removed from the system. A dataset update corresponds to an object insertion or deletion, and it comes one by one [17], [19], [22]. Let P be a set of objects generated so far.…”

Section: Problem Definitionmentioning

confidence: 99%

“…It is hence desirable to specify ρ min , and δ min on-demand. This on-demand clustering query for dynamic data is common in the literature, e.g., [17], [20], [22], [23]. Once cluster centers are determined, cluster label propagation is done in Θ(n) time by traversing dependent objects from the cluster centers in a breadth-first search manner.…”

Section: Problem Definitionmentioning

confidence: 99%

Scalable and Accurate Density-Peaks Clustering on Fully Dynamic Data

Amagata

2022

2022 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

Clustering is a primitive and important operator that analyzes a given dataset to discover its hidden patterns and features. Because datasets are usually updated dynamically (i.e., it accepts continuous insertions and arbitrary deletions), analyzing such dynamic data is also an important topic, and dynamic clustering effectively supports it, but is a challenging problem. In this paper, we consider the problem of densitypeaks clustering (DPC) on dynamic data. DPC is one of the density-based clustering algorithms and attracts attention for many applications, due to its effectiveness. We investigate the hardness of this problem theoretically to measure the efficiencies of dynamic DPC algorithms. We prove that any exact solutions are costly, and propose an approximation algorithm to enable faster updates. We conduct experiments on real datasets, and the results confirm that our algorithm is much faster and more accurate than state-of-the-art.

show abstract

“…This is why approximation and data reduction techniques are popular choices for accelerating existing clustering algorithms. In fact, the paradigm of coresets has seen great success in the task of approximating solutions for clustering problems (see [10], [12], [4], [22] and [1]). The sensitivity framework, originally proposed for constructing coresets for clustering problems, requires a sub-optimal clustering of the input data D in order to compute the sensitivity for each point in D. This requirement transfers to CABLR, described in the previous section.…”

Section: Challenges and Problemsmentioning

confidence: 99%

Coreset-Based Data Compression for Logistic Regression

Riquelme-Granada

Nguyen

Luo

2021

Communications in Computer and Information Science

View full text Add to dashboard Cite

The coreset paradigm is a fundamental tool for analysing complex and large datasets. Although coresets are used as an acceleration technique for many learning problems, the algorithms used for constructing them may become computationally exhaustive in some settings. We show that this can easily happen when computing coresets for learning a logistic regression classifier. We overcome this issue with two methods: Accelerating Clustering via Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF); the former is an acceleration procedure based on a simple theoretical observation on using Uniform Random Sampling for clustering problems, the latter is a coreset-based data-summarising framework that builds on ACvS and extend it by using a regression algorithm as part of the construction. We tested both procedures on five public datasets, and observed that computing the coreset and learning from it, is 11 times faster than learning directly from the full input data in the worst case, and 34 times faster in the best case. We further observed that the best regression algorithm for creating summaries of data using the RDSF framework is the Ordinary Least Squares (OLS).

show abstract

Streaming k-Means Clustering with Fast Queries

Cited by 15 publications

References 14 publications

Untitled

Untitled

Scalable and Accurate Density-Peaks Clustering on Fully Dynamic Data

Coreset-Based Data Compression for Logistic Regression

Contact Info

Product

Resources

About