Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019) 2019
DOI: 10.18653/v1/w19-4322
|View full text |Cite
|
Sign up to set email alerts
|

A Self-Training Approach for Short Text Clustering

Abstract: Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations for short texts. Lowdimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative featu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
62
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 73 publications
(62 citation statements)
references
References 18 publications
0
62
0
Order By: Relevance
“…A major challenge in short text clustering is the sparseness of the vector representations of these texts resulting from the small number of words in each text. Several clustering methods have been proposed in the literature to address this challenge, including methods based on text augmentation [10,11], neural networks [2,3], topic modeling [12], and Dirichlet mixture model [4].…”
Section: Short Text Clusteringmentioning
confidence: 99%
See 1 more Smart Citation
“…A major challenge in short text clustering is the sparseness of the vector representations of these texts resulting from the small number of words in each text. Several clustering methods have been proposed in the literature to address this challenge, including methods based on text augmentation [10,11], neural networks [2,3], topic modeling [12], and Dirichlet mixture model [4].…”
Section: Short Text Clusteringmentioning
confidence: 99%
“…To achieve this, we remove outliers from each cluster and reassign them to clusters with which they have greater similarity. We demonstrate that this approach produces more accurate cluster partitions than computationally more costly state-of-theart short text clustering methods based on neural networks [2,3].…”
Section: Introductionmentioning
confidence: 95%
“…Hadifar et al (Hadifar et al, 2019) also uses an Autoencoder in order to help classify text. They also use the Autoencoder to pre-train the encoder, but instead of classifying text, they use KNN for clustering similar text with the learned latent space.…”
Section: Convolutional Autoencodermentioning
confidence: 99%
“…For ST-AEs, we used two versions of pre-trained word2vec embeddings: one is for Web-snippets and 20Nshort 8 ,the other is for Twitter 9 . We fixed = 0.1 value for all corpora and set the batch size to 64 and pre-trained the autoencoder for 15 epochs [48]. Please note that we removed the words that were not in the word embedding lookup table.…”
Section: B Experimental Proceduresmentioning
confidence: 99%