Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval 2013
DOI: 10.1145/2484028.2484140
|View full text |Cite
|
Sign up to set email alerts
|

Document classification by topic labeling

Abstract: In this paper, we propose Latent Dirichlet Allocation (LDA) [1] based document classification algorithm which does not require any labeled dataset. In our algorithm, we construct a topic model using LDA, assign one topic to one of the class labels, aggregate all the same class label topics into a single topic using the aggregation property of the Dirichlet distribution and then automatically assign a class label to each unlabeled document depending on its "closeness" to one of the aggregated topics. We present… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
55
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
4
2
2

Relationship

2
6

Authors

Journals

citations
Cited by 68 publications
(55 citation statements)
references
References 4 publications
0
55
0
Order By: Relevance
“…If we compare pc-mac dataset and med-space dataset, then we can observe in Table 13 that H(MD) is higher for pc-mac dataset than that of med-space dataset and it also can be verified by observing topics inferred on both the datasets (Table 1 and 12). As per the quality of topics, we can say that pc-mac dataset is "harder" to classify than med-space dataset which is also evident from text classification performance using both TLC and SVM.…”
Section: Analysis Of Quality Of Topicsmentioning
confidence: 67%
See 2 more Smart Citations
“…If we compare pc-mac dataset and med-space dataset, then we can observe in Table 13 that H(MD) is higher for pc-mac dataset than that of med-space dataset and it also can be verified by observing topics inferred on both the datasets (Table 1 and 12). As per the quality of topics, we can say that pc-mac dataset is "harder" to classify than med-space dataset which is also evident from text classification performance using both TLC and SVM.…”
Section: Analysis Of Quality Of Topicsmentioning
confidence: 67%
“…Hingmire et al [12] propose a text classification algorithm, ClassifyLDA, based on labeling of LDA topics. In ClassifyLDA algorithm, an annotator assigns a single class label to each topic.…”
Section: Background and Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Hence, topics inferred by LDA may not correlate well with human judgements even though they better optimize perplexity on held-out documents (Chang et al, 2009). Given the growing importance of topic models like LDA in text mining techniques and applications (Hingmire et al, 2013;Lin and He, 2009;Pawar et al, 2016), it is crucial to ensure that the inferred topics are of as high quality as possible. As shown in (Aletras et al, 2017), computing topic coherence is also important for developing better topic representation methods for use in Information Retrieval.…”
Section: Introductionmentioning
confidence: 99%
“…Conventional methods often take advantage of co-occurrence word-based model (e.g., unigram, bigram and trigram) or syntax clues to represent the classification features, which suffers from data sparsity problem, to some extent. To address this problem, feature selection (e.g., frequency, mutual information [1], topic model [2,3]) have been received extensive discussions and comparisons. Nevertheless, the features selection applied in these methods are often time-consuming and computationally prohibit.…”
Section: Introductionmentioning
confidence: 99%