Document classification by topic labeling

Hingmire, Swapnil; Chougule, Sandeep; Palshikar, Girish Keshav; Chakraborti, Sutanu

doi:10.1145/2484028.2484140

Cited by 68 publications

(55 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If we compare pc-mac dataset and med-space dataset, then we can observe in Table 13 that H(MD) is higher for pc-mac dataset than that of med-space dataset and it also can be verified by observing topics inferred on both the datasets (Table 1 and 12). As per the quality of topics, we can say that pc-mac dataset is "harder" to classify than med-space dataset which is also evident from text classification performance using both TLC and SVM.…”

Section: Analysis Of Quality Of Topicsmentioning

confidence: 67%

“…Hingmire et al [12] propose a text classification algorithm, ClassifyLDA, based on labeling of LDA topics. In ClassifyLDA algorithm, an annotator assigns a single class label to each topic.…”

Section: Background and Related Workmentioning

confidence: 99%

“…We determine the effectiveness of our algorithm in relation to two weakly supervised text classification algorithms: GE-FL [8] and ClassifyLDA [12]. We evaluate and compare our text classification algorithm by computing Macro averaged F1.…”

Section: Experimental Evaluationmentioning

confidence: 99%

See 2 more Smart Citations

Topic labeled text classification

Hingmire

Chakraborti

2014

Proceedings of the 37th International ACM SIGIR Conference on Research &Amp; Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

Supervised text classifiers require extensive human expertise and labeling efforts. In this paper, we propose a weakly supervised text classification algorithm based on the labeling of Latent Dirichlet Allocation (LDA) topics. Our algorithm is based on the generative property of LDA. In our algorithm, we ask an annotator to assign one or more class labels to each topic, based on its most probable words. We classify a document based on its posterior topic proportions and the class labels of the topics. We also enhance our approach by incorporating domain knowledge in the form of labeled words. We evaluate our approach on four real world text classification datasets. The results show that our approach is more accurate in comparison to semi-supervised techniques from previous work. A central contribution of this work is an approach that delivers effectiveness comparable to the state-of-the-art supervised techniques in hard-toclassify domains, with very low overheads in terms of manual knowledge engineering.

show abstract

Section: Analysis Of Quality Of Topicsmentioning

confidence: 67%

Section: Background and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Topic labeled text classification

Hingmire

Chakraborti

2014

Proceedings of the 37th International ACM SIGIR Conference on Research &Amp; Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

“…Hence, topics inferred by LDA may not correlate well with human judgements even though they better optimize perplexity on held-out documents (Chang et al, 2009). Given the growing importance of topic models like LDA in text mining techniques and applications (Hingmire et al, 2013;Lin and He, 2009;Pawar et al, 2016), it is crucial to ensure that the inferred topics are of as high quality as possible. As shown in (Aletras et al, 2017), computing topic coherence is also important for developing better topic representation methods for use in Information Retrieval.…”

Section: Introductionmentioning

confidence: 99%

Measuring Topic Coherence through Optimal Word Buckets

Ramrakhiyani

Pawar

Hingmire

et al. 2017

Proceedings of the 15th Conference of the European Chapter of The Association for Computational Linguistics: Volume 2

Self Cite

View full text Add to dashboard Cite

Measuring topic quality is essential for scoring the learned topics and their subsequent use in Information Retrieval and Text classification. To measure quality of Latent Dirichlet Allocation (LDA) based topics learned from text, we propose a novel approach based on grouping of topic words into buckets (TBuckets). A single large bucket signifies a single coherent theme, in turn indicating high topic coherence. TBuckets uses word embeddings of topic words and employs singular value decomposition (SVD) and Integer Linear Programming based optimization to create coherent word buckets. TBuckets outperforms the state-of-the-art techniques when evaluated using 3 publicly available datasets and on another one proposed in this paper.

show abstract

“…Conventional methods often take advantage of co-occurrence word-based model (e.g., unigram, bigram and trigram) or syntax clues to represent the classification features, which suffers from data sparsity problem, to some extent. To address this problem, feature selection (e.g., frequency, mutual information [1], topic model [2,3]) have been received extensive discussions and comparisons. Nevertheless, the features selection applied in these methods are often time-consuming and computationally prohibit.…”

Section: Introductionmentioning

confidence: 99%