A Self-Training Approach for Short Text Clustering

Hadifar, Amir; Sterckx, Lucas; Demeester, Thomas; Develder, Chris

doi:10.18653/v1/w19-4322

Cited by 73 publications

(62 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A major challenge in short text clustering is the sparseness of the vector representations of these texts resulting from the small number of words in each text. Several clustering methods have been proposed in the literature to address this challenge, including methods based on text augmentation [10,11], neural networks [2,3], topic modeling [12], and Dirichlet mixture model [4].…”

Section: Short Text Clusteringmentioning

confidence: 99%

See 1 more Smart Citation

Enhancement of Short Text Clustering by Iterative Classification

Rakib

Zeh

Jankowska

et al. 2020

Natural Language Processing and Information Systems

View full text Add to dashboard Cite

Short text clustering is a challenging task due to the lack of signal contained in short texts. In this work, we propose iterative classification as a method to boost the clustering quality of short texts. The idea is to repeatedly reassign (classify) outliers to clusters until the cluster assignment stabilizes. The classifier used in each iteration is trained using the current set of cluster labels of the non-outliers; the input of the first iteration is the output of an arbitrary clustering algorithm. Thus, our method does not require any human-annotated labels for training. Our experimental results show that the proposed clustering enhancement method not only improves the clustering quality of different baseline clustering methods (e.g., k-means, k-means--, and hierarchical clustering) but also outperforms the state-of-the-art short text clustering methods on several short text datasets by a statistically significant margin.

show abstract

Section: Short Text Clusteringmentioning

confidence: 99%

“…To achieve this, we remove outliers from each cluster and reassign them to clusters with which they have greater similarity. We demonstrate that this approach produces more accurate cluster partitions than computationally more costly state-of-theart short text clustering methods based on neural networks [2,3].…”

Section: Introductionmentioning

confidence: 95%

Enhancement of Short Text Clustering by Iterative Classification

Rakib

Zeh

Jankowska

et al. 2020

Natural Language Processing and Information Systems

View full text Add to dashboard Cite

show abstract

“…Hadifar et al (Hadifar et al, 2019) also uses an Autoencoder in order to help classify text. They also use the Autoencoder to pre-train the encoder, but instead of classifying text, they use KNN for clustering similar text with the learned latent space.…”

Section: Convolutional Autoencodermentioning

confidence: 99%

COVID-19 Surveillance through Twitter using Self-Supervised and Few Shot Learning

Lwowski¹,

Najafirad²

2020

Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

View full text Add to dashboard Cite

Public health surveillance and tracking virus via social media can be a useful digital tool for contact tracing and preventing the spread of the virus. Nowadays, large volumes of COVID-19 tweets can quickly be processed in real-time to offer information to researchers. Nonetheless, due to the absence of labeled data for COVID-19, the preliminary supervised classifier or semi-supervised self-labeled methods will not handle non-spherical data with adequate accuracy. With the seasonal influenza and novel Coronavirus having many similar symptoms, we propose using few shot learning to fine-tune a semi-supervised model built on unlabeled COVID-19 and previously labeled influenza dataset that can provide insights into COVID-19 that have not been investigated. The experimental results show the efficacy of the proposed model with an accuracy of 86%, identification of Covid-19 related discussion using recently collected tweets.

show abstract

“…For ST-AEs, we used two versions of pre-trained word2vec embeddings: one is for Web-snippets and 20Nshort 8 ，the other is for Twitter 9 . We fixed = 0.1 value for all corpora and set the batch size to 64 and pre-trained the autoencoder for 15 epochs [48]. Please note that we removed the words that were not in the word embedding lookup table.…”

Section: B Experimental Proceduresmentioning

confidence: 99%