Enhancement of Short Text Clustering by Iterative Classification

Rakib, Rashadul Hasan; Zeh, Norbert; Jankowska, Magdalena; Milios, Evangelos

doi:10.1007/978-3-030-51310-8_10

Cited by 23 publications

(13 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For text clustering, we compare the proposed TCL with 11 benchmarks, including TF/TF-IDF (Jones, 1972), BagOfWords (BOW) (Harris, 1954), SkipVec (Kiros et al, 2015), Para2Vec (Le and Mikolov, 2014), GSDPMM (Yin and Wang, 2016), RecNN (Socher et al, 2011), STCC (Xu et al, 2017b), HAC-SD (Rakib et al, 2020), ECIC (Rakib et al, 2020), and SCCL (Zhang et al, 2021a). Similarly, the vanilla k-means is conducted on the extracted features to cluster data for those representation-based methods, including BOW, TF/TF-IDF, SkipVec, Para2Vec, and RecNN.…”

Section: Compared Methodsmentioning

confidence: 99%

Twin Contrastive Learning for Online Clustering

Yang

Peng

et al. 2022

Int J Comput Vis

View full text Add to dashboard Cite

This paper proposes to perform online clustering by conducting twin contrastive learning (TCL) at the instance and cluster level. Specifically, we find that when the data is projected into a feature space with a dimensionality of the target cluster number, the rows and columns of its feature matrix correspond to the instance and cluster representation, respectively. Based on the observation, for a given dataset, the proposed TCL first constructs positive and negative pairs through data augmentations. Thereafter, in the row and column space of the feature matrix, instance-and cluster-level contrastive learning are respectively conducted by pulling together positive pairs while pushing apart the negatives. To alleviate the influence of intrinsic false-negative pairs and rectify cluster assignments, we adopt a confidence-based criterion to select pseudolabels for boosting both the instance-and cluster-level contrastive learning. As a result, the clustering performance is further improved. Besides the elegant idea of twin contrastive learning, another advantage of TCL is that it could independently predict the cluster assignment for each instance, thus effortlessly fitting online scenarios. Extensive experiments on six widely-used image and text benchmarks demonstrate the effectiveness of TCL. The code will be released on GitHub.

show abstract

Section: Compared Methodsmentioning

confidence: 99%

Twin Contrastive Learning for Online Clustering

Yang

Peng

et al. 2022

Int J Comput Vis

View full text Add to dashboard Cite

show abstract

“…Another interesting technique concerning intersections of classification and clustering of short texts is presented in [ 34 ] where a classifier is trained with cluster labels to improve the previous clustering.…”

Section: Previous Workmentioning

confidence: 99%

Eigenvalue based spectral classification

et al. 2023

View full text Add to dashboard Cite

This paper describes a new method of classification based on spectral analysis. The motivations behind developing the new model were the failures of the classical spectral cluster analysis based on combinatorial and normalized Laplacian for a set of real-world datasets of textual documents. Reasons of the failures are analysed. While the known methods are all based on usage of eigenvectors of graph Laplacians, a new classification method based on eigenvalues of graph Laplacians is proposed and studied.

show abstract

“…Rajan et al [44] depict a clustering process to aggregate patent descriptions into similar groups to facilitate the search process in patent databases. Rakib et al [45] propose an iterative classification method that improves the clustering of short texts. This is done by detecting outliers during the clustering process and changing the clusters to which they are assigned.…”

Section: Related Workmentioning

confidence: 99%

Approaches for the Clustering of Geographic Metadata and the Automatic Detection of Quasi-Spatial Dataset Series

Lacasta

López-Pellicer

Zarazaga‐Soria

et al. 2022

IJGI

View full text Add to dashboard Cite

The discrete representation of resources in geospatial catalogues affects their information retrieval performance. The performance could be improved by using automatically generated clusters of related resources, which we name quasi-spatial dataset series. This work evaluates whether a clustering process can create quasi-spatial dataset series using only textual information from metadata elements. We assess the combination of different kinds of text cleaning approaches, word and sentence-embeddings representations (Word2Vec, GloVe, FastText, ELMo, Sentence BERT, and Universal Sentence Encoder), and clustering techniques (K-Means, DBSCAN, OPTICS, and agglomerative clustering) for the task. The results demonstrate that combining word-embeddings representations with an agglomerative-based clustering creates better quasi-spatial dataset series than the other approaches. In addition, we have found that the ELMo representation with agglomerative clustering produces good results without any preprocessing step for text cleaning.

show abstract

Enhancement of Short Text Clustering by Iterative Classification

Cited by 23 publications

References 16 publications

Twin Contrastive Learning for Online Clustering

Twin Contrastive Learning for Online Clustering

Eigenvalue based spectral classification

Approaches for the Clustering of Geographic Metadata and the Automatic Detection of Quasi-Spatial Dataset Series

Contact Info

Product

Resources

About