Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2002
DOI: 10.1145/564376.564401
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised document classification using sequential information maximization

Abstract: We present a novel sequential clustering algorithm which is motivated by the Information Bottleneck (IB) method. In contrast to the agglomerative IB algorithm, the new sequential (sIB) approach is guaranteed to converge to a local maximum of the information, as required by the original IB principle. Moreover, the time and space complexity are significantly improved. We apply this algorithm to unsupervised document classification. In our evaluation, on small and medium size corpora, the sIB is found to be consi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
125
0
2

Year Published

2005
2005
2012
2012

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 177 publications
(129 citation statements)
references
References 13 publications
2
125
0
2
Order By: Relevance
“…The work in [72] uses a partially supervised EM-algorithm which iteratively assigns labels to the unlabeled documents and refines them over time as convergence is achieved. A number of similar methods along this spirit are proposed in [4,14,35,47,89] with varying levels of supervision in the clustering process. Partially supervised clustering methods are also used feature transformation in classification using the methods as discussed in [17,18,88].…”
Section: Semi-supervised Clusteringmentioning
confidence: 99%
“…The work in [72] uses a partially supervised EM-algorithm which iteratively assigns labels to the unlabeled documents and refines them over time as convergence is achieved. A number of similar methods along this spirit are proposed in [4,14,35,47,89] with varying levels of supervision in the clustering process. Partially supervised clustering methods are also used feature transformation in classification using the methods as discussed in [17,18,88].…”
Section: Semi-supervised Clusteringmentioning
confidence: 99%
“…As in training, the 3x3, 5x5 and 10x10 maps were all tested. To analyse the results of categorisation between the topographic maps we utilised techniques from conventional text-based categorization analysis including: Precision [50], the Jacaard or JAC method [51], and the Fowlkes-Mallows or FM method [52]. Since classification is unsupervised it is not possible to apply these evaluation methods directly as would be the case for supervised learning.…”
Section: Testingmentioning
confidence: 99%
“…For this reason, the labels (architects) identified from training are maintained so as to assign categories. The "micro-averaged" precision matrix method [50] was first used to evaluate each network and the well-established JAC and FM methods were then used to evaluate cluster quality; see [40] for further details of these evaluation methods.…”
Section: Testingmentioning
confidence: 99%
“…Each sequence X n is generated by some unknown random process P X|Y , uniquely determined by its label Y . Assuming that each element of X n lies in a discrete and finite set X , its empirical distribution (or type [11]) is defined as the pmfPXn (x) = n −1 P n i=1 1(Xi = x), i.e., it results from counting the number of occurences of each symbol 3 x of X in X n , and is an approximation of the true process 4 . We now have the following Problem Formulation: Given L = |Y|, we want to find a partition A1, .…”
Section: Preliminaries and Problem Formulationmentioning
confidence: 99%
“…Experimental results from a benchmark task of document categorization from the "20 Newsgroups" corpus [10] show that ISPDTs, combined with Jensen-Rényi divergences and "strapping", are competitive with, and in most cases outperform, the sequential information bottleneck procedure [3], which is considered the state-of-the-art in unsupervised document categorization. The paper is organized as follows.…”
Section: Introductionmentioning
confidence: 99%