Multi-label text classification is an increasingly important field as large amounts of text data are available and extracting relevant information is important in many application contexts. Probabilistic generative models are the basis of a number of popular text mining methods such as Naive Bayes or Latent Dirichlet Allocation. However, Bayesian models for multi-label text classification often are overly complicated to account for label dependencies and skewed label frequencies while at the same time preventing overfitting. To solve this problem we employ the same technique that contributed to the success of deep learning in recent years: greedy layer-wise training. Applying this technique in the supervised setting prevents overfitting and leads to better classification accuracy. The intuition behind this approach is to learn the labels first and subsequently add a more abstract layer to represent dependencies among the labels. This allows using a relatively simple hierarchical topic model which can easily be adapted to the online setting. We show that our method successfully models dependencies online for large-scale multi-label datasets with many labels and improves over the baseline method not modeling dependencies. The same strategy, layer-wise greedy training, also makes the batch variant competitive with existing more complex multi-label topic models.
Every day, an enormous amount of text data is produced. Sources of text data include news, social media, emails, text messages, medical reports, scientific publications and fiction. To keep track of this data, there are categories, key words, tags or labels that are assigned to each text. Automatically predicting such labels is the task of multi-label text classification. Often however, we are interested in more than just the pure classification: rather, we would like to understand which parts of a text belong to the label, which words are important for the label or which labels occur together. Because of this, topic models may be used for multi-label classification as an interpretable model that is flexible and easily extensible. This survey demonstrates the manifold possibilities and flexibility of the topic model framework for the complex setting of multi-label text classification by categorizing different variants of models.
Two fundamental and prominent methods for multi-label classification, Binary Relevance (BR) and Classifier Chains (CC), are usually considered to be distinct methods without direct relationship. However, BR and CC can be generalized to one single method: blockwise classifier chains (BCC), where labels within a block (i.e. a group of labels of fixed size) are predicted independently as in BR but then combined to predict the next block's labels as in CC. In other words, only the blocks are connected in a chain. BR is then a special case of BCC with a block size equal to the number of labels, and CC a special case with a block size equal to one. The rationale behind BCC is to limit the propagation of errors made by inaccurate classifiers early in the chain, which should be alleviated by the expected block effect. Another, yet different generalization is based on the divideand-conquer principle, not error propagation, but fails to exhibit the desired block effect. Ensembles of BCC are also discussed and experiments confirm that their performance is on par with ensembles of CC. Further experiments show the effect of the block size, in particular with respect to the performance of the two extremes, BR and CC. As it turns out, some regions of the block size parameter space lead to degraded performance, whereas others improve performance to a noticeable but modest extent.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.