A scalability analysis of classifiers in text categorization

Yang, Yiming; Zhang, Jian; Kisiel, Bryan

doi:10.1145/860454.860455

Cited by 56 publications

(80 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…-Very good effectiveness, as shown in several text classification experiments [6][7][8][9]; this effectiveness is often due to their natural ability to deal with non-linearly separable classes; -The fact that they scale extremely well (better than SVMs) to very high numbers of classes [9]. In fact, computing the |T r| distance scores and sorting them in descending order (as from Step 1) needs to be performed only once, irrespectively of the number m of classes involved; this means that distance-weighted k-NN scales (wildly) sublinearly with the number of classes involved, while learning methods that generate linear classifiers scale linearly, since none of the computation needed for generating a single classifierΦ ′ can be reused for the generation of another classifierΦ ′′ , even if the same training set T r is involved.…”

Section: (Similarly To Equation 1) Identify the Setmentioning

confidence: 89%

Adaptive Committees of Feature-specific Classifiers for Image Classification

Falchi¹,

Fagni²,

Sebastiani³

2009

Proceedings of the 2nd International Workshop on Image Mining Theory and Applications

View full text Add to dashboard Cite

show abstract

Section: (Similarly To Equation 1) Identify the Setmentioning

confidence: 89%

Adaptive Committees of Feature-specific Classifiers for Image Classification

Falchi¹,

Fagni²,

Sebastiani³

2009

Proceedings of the 2nd International Workshop on Image Mining Theory and Applications

View full text Add to dashboard Cite

show abstract

“…In contrast to the datasets typically utilized in research, multilabel corpora in the real world can contain thousands or tens of thousands of labels, and the label frequencies in these datasets tend to have highly skewed frequency-distributions with power-law statistics (Yang et al 2003;Liu et al 2005;Dekel and Shamir 2010). Figure 1 illustrates this point for three large real-world corpora-each containing thousands of unique labels-by plotting the number of labels within each corpus as a function of label-frequency.…”

Section: Background and Motivationmentioning

confidence: 99%

Statistical topic models for multi-label document classification

et al. 2011

View full text Add to dashboard Cite

Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A drawback of these approaches is that performance rapidly drops off as the total number of labels and the number of labels per document increase. This problem is amplified when the label frequencies exhibit the type of highly skewed distributions that are often observed in real-world datasets. In this paper we investigate a class of generative statistical topic models for multi-label documents that associate individual word tokens with different labels. We investigate the advantages of this approach relative to discriminative models, particularly with respect to classification problems involving large numbers of relatively rare labels. We compare the performance of generative and discriminative approaches on document labeling tasks ranging from datasets with several thousand labels to datasets with tens of labels. The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies.

show abstract

“…For instance, the ''shrinkage'' method presented in McCallum et al (1998) is aimed at improving parameter estimation for data-sparse leaf categories in a 1-ofn HTC system based on a naïve Bayesian method; the underlying intuitions are specific to naïve Bayesian methods, and do not easily carry over to other contexts. Incidentally, the naïve Bayesian approach seems to have been the most popular among HTC researchers, since several other HTC models are hierarchical variations of naïve Bayesian learning algorithms (Chakrabarti et al 1998;Gaussier et al 2002;Toutanova et al 2001;Vinokourov and Girolami 2002); SVMs have also recently gained popularity in this respect (Cai and Hofmann 2004;Dumais and Chen 2000;Liu et al 2005;Yang et al 2003).…”

Section: Related Workmentioning

confidence: 99%

“…Many of these intuitions have been used in close association with a specific learning algorithm; the most popular choices in this respect have been naïve Bayesian methods (Chakrabarti et al 1998;Gaussier et al 2002;Koller and Sahami 1997;McCallum et al 1998;Toutanova et al 2001;Vinokourov and Girolami 2002), neural networks (Ruiz and Srinivasan 2002;Weigend et al 1999;Wiener et al 1995), support vector machines (Cai and Hofmann 2004;Dumais and Chen 2000;Liu et al 2005;Yang et al 2003), and example-based classifiers (Yang et al 2003).…”

mentioning

confidence: 99%

Boosting multi-label hierarchical text categorization

2008

View full text Add to dashboard Cite

Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical structure, so far the attention of text classification researchers has mostly focused on algorithms for ''flat'' classification, i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TREEBOOST.MH, a multi-label HTC algorithm consisting of a hierarchical variant of ADABOOST.MH, a very well-known member of the family of ''boosting'' learning algorithms. TREEBOOST.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed ''locally'', i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated ''locally''. All these intuitions are embodied within TREEBOOST.MH in an elegant and simple way, i.e. by defining TREEBOOST.MH as a recursive algorithm that uses ADABOOST.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TREEBOOST.MH on three HTC benchmarks, and discuss analytically its computational cost.

show abstract

A scalability analysis of classifiers in text categorization

Cited by 56 publications

References 5 publications

Adaptive Committees of Feature-specific Classifiers for Image Classification

Adaptive Committees of Feature-specific Classifiers for Image Classification

Statistical topic models for multi-label document classification

Boosting multi-label hierarchical text categorization

Contact Info

Product

Resources

About