LDA boost classification: boosting by topics

La, Lei; Qiao, Guo; Cao, Qing; Qitao, Li

doi:10.1186/1687-6180-2012-233

Cited by 10 publications

(5 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mallet is a library of Java code for machine learning applied to text applications developed by Andrew McCallum. The number of topics for each dataset is obtained according to equation (16). For LDA estimation of the training set, the inputs are (number of iterations = 2000, α= 50/K, β= 0.1), and for the LDA prediction, the number of the given resampling iterations is 1000.…”

Section: Experiments Settingsmentioning

confidence: 99%

“…A related study was conducted by Lei et al [16] to use LDA as a feature representation method for TC based on the binary version of AdaBoost algorithm and using Naive Bayes as a weak learner for AdaBoost. However, Naive Bayes works based on the feature frequencies, while representing the documents as latent topics means that each document is represented as a small number of weighted and unique topics.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization

Al-Salemi

Aziz

Noah

2014

Journal of Information Science

View full text Add to dashboard Cite

AdaBoost.MH is a boosting algorithm that is considered to be one of the most accurate algorithms for multilabel classification. It works by iteratively building a committee of weak hypotheses of decision stumps. To build the weak hypotheses, in each iteration, AdaBoost.MH obtains the whole extracted features and examines them one by one to check their ability to characterize the appropriate category. Using Bag-Of-Words for text representation dramatically increases the computational time of AdaBoost.MH learning, especially for large-scale datasets. In this paper we demonstrate how to improve the efficiency and effectiveness of AdaBoost.MH using latent topics, rather than words. A well-known probabilistic topic modelling method, Latent Dirichlet Allocation, is used to estimate the latent topics in the corpus as features for AdaBoost.MH. To evaluate LDA-AdaBoost.MH, the following four datasets have been used: Reuters-21578-ModApte, WebKB, 20-Newsgroups and a collection of Arabic news. The experimental results confirmed that representing the texts as a small number of latent topics, rather than a large number of words, significantly decreased the computational time of AdaBoost.MH learning and improved its performance for text categorization.

show abstract

Section: Experiments Settingsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization

Al-Salemi

Aziz

Noah

2014

Journal of Information Science

View full text Add to dashboard Cite

show abstract

“…The results show that this improved random forests outperformed the popular text classification methods such as Naïve bayes, SVM, KNN, RF in terms of classification performance, it gave an f-score up to 91%. Lei, Qiao, Qimin & Qitao (2012) perform topic text categorization using LDAboost ensemble method learning. The experimental result showed that LDAboost increase the performance from 73.3% to 90%.…”

Section: Approaches Using Ensemble Learning Methodsmentioning

confidence: 99%

“…A second classifier is then created after it, to focus on the instances in the training data that the first classifier got wrong. The process continues to add classifiers until a limit is reached in the number of models or accuracy (Lei, Qiao, Qimin & Qitao, 2012).…”

Section: Boostingmentioning

confidence: 99%

Classifying Web Pages by Aimed Nation Using Machine Learning

Tarik

Mahmoud

Elberrichi

2017

International Journal of Organizational and Collective Intelligence

View full text Add to dashboard Cite

Classifying web pages is to automatically assign predefined class to them. It is one of the main applications of web mining. The authors' aim is to detect the targeted nation based on the web pages content. It is an original application. In this paper, the authors propose different web mining approaches using machine learning algorithms such as Naïve Bayes and Support Vector Machine in order classify them. They present detailed stages of the procedure. The best experimental result based on an original corpus created by their own means shows a very attention grabbing f-score of 85%.

show abstract

“…The proposed method was based on boosting algorithms for multilabel multiclass text categorization, outperforming text classifiers based on TF-IDF [12] and naive Bayes. The use of LDA-based features in boosting algorithms was introduced by La et al [30]. The method, named LDABoost, uses latent topics extracted from one LDA model as text features.…”

Section: Related Workmentioning

confidence: 99%

Topic Models Ensembles for AD-HOC Information Retrieval

2021

View full text Add to dashboard Cite

Ad hoc information retrieval (ad hoc IR) is a challenging task consisting of ranking text documents for bag-of-words (BOW) queries. Classic approaches based on query and document text vectors use term-weighting functions to rank the documents. Some of these methods’ limitations consist of their inability to work with polysemic concepts. In addition, these methods introduce fake orthogonalities between semantically related words. To address these limitations, model-based IR approaches based on topics have been explored. Specifically, topic models based on Latent Dirichlet Allocation (LDA) allow building representations of text documents in the latent space of topics, the better modeling of polysemy and avoiding the generation of orthogonal representations between related terms. We extend LDA-based IR strategies using different ensemble strategies. Model selection obeys the ensemble learning paradigm, for which we test two successful approaches widely used in supervised learning. We study Boosting and Bagging techniques for topic models, using each model as a weak IR expert. Then, we merge the ranking lists obtained from each model using a simple but effective top-k list fusion approach. We show that our proposal strengthens the results in precision and recall, outperforming classic IR models and strong baselines based on topic models.

show abstract

LDA boost classification: boosting by topics

Cited by 10 publications

References 15 publications

LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization

LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization

Classifying Web Pages by Aimed Nation Using Machine Learning

Topic Models Ensembles for AD-HOC Information Retrieval

Contact Info

Product

Resources

About