Document Clustering with K-tree

Vries, Christopher M. De; Geva, Shlomo

doi:10.1007/978-3-642-03761-0_43

Cited by 7 publications

(8 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A novel contribution of this paper is our investigation of Okapi BM25 (BM25) feature weighting. Only recently has BM25 been seriously considered in document clustering (de Vries and Geva 2008;Bashier and Rauber 2009;Whissell et al 2009;D'hondt et al 2010;Kutty et al 2010); with works that do use BM25 still being a small minority. Bashier and Rauber (2009) investigate relevance feedback using clustering.…”

Section: Introductionmentioning

confidence: 99%

“…Bashier and Rauber (2009) investigate relevance feedback using clustering. de Vries and Geva (2008) use BM25 weighting when clustering XML documents, but offer no comparison to tf-idf weighting using their clustering method. Kutty et al (2010) also cluster XML documents using BM25 weighting, with the authors showing an improvement over tf-idf.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improving document clustering using Okapi BM25 feature weighting

Whissell

Clarke

2011

Inf Retrieval

View full text Add to dashboard Cite

We investigate the effect of feature weighting on document clustering, including a novel investigation of Okapi BM25 feature weighting. Using eight document datasets and 17 well-established clustering algorithms we show that the benefit of tf-idf weighting over tf weighting is heavily dependent on both the dataset being clustered and the algorithm used. In addition, binary weighting is shown to be consistently inferior to both tf-idf weighting and tf weighting. We investigate clustering using both BM25 term saturation in isolation and BM25 term saturation with idf, confirming that both are superior to their non-BM25 counterparts under several common clustering quality measures. Finally, we investigate estimation of the k1 BM25 parameter when clustering. Our results indicate that typical values of k1 from other IR tasks are not appropriate for clustering; k1 needs to be higher.keywords Document clustering Á Feature weighting Á Okapi BM25

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Improving document clustering using Okapi BM25 feature weighting

Whissell

Clarke

2011

Inf Retrieval

View full text Add to dashboard Cite

show abstract

“…The standard measures of Purity, Entropy, NMI and F1 are used to determine the quality of clusters with regard to the categories. Negentropy [5] is also used. It measures the same system property as Entropy but it is normalized and inverted so scores fall between 0 and 1 where 0 is the worst and 1 is the best.…”

Section: Clustering Evaluation Measuresmentioning

confidence: 99%

“…Micro-purity of the clustering solution ω is obtained as a weighted sum of individual cluster purity. Macro-purity is the unweighted arithmetic mean based on the total number of categories [5].…”

Section: Clustering Evaluation Measuresmentioning

confidence: 99%

Overview of the INEX 2010 XML Mining Track: Clustering and Classification of XML Documents

Vries

Nayak

Kutty

et al. 2011

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…In another work, Zhang et al [13] make use of the hyperlink structure between XML documents through an extension of a machine learning method based on the Self Organizing Maps for graphs. De Vries et al [9] use K-Trees to cluster XML documents so that they can obtain clusters in good quality with a low complexity method. Lastly, Tran et al [12] construct a latent semantic kernel to measure the similarity between content of the XML documents.…”

Section: Introductionmentioning

confidence: 99%

Exploiting Index Pruning Methods for Clustering XML Collections

Altıngövde

Atilgan

Ulusoy

2010

Focused Retrieval and Evaluation

View full text Add to dashboard Cite

Abstract. In this paper, we first employ the well known Cover-Coefficient Based Clustering Methodology (C 3 M) for clustering XML documents. Next, we apply index pruning techniques from the literature to reduce the size of the document vectors. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more specifically, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics.

show abstract

Document Clustering with K-tree

Cited by 7 publications

References 13 publications

Improving document clustering using Okapi BM25 feature weighting

Improving document clustering using Okapi BM25 feature weighting

Overview of the INEX 2010 XML Mining Track: Clustering and Classification of XML Documents

Exploiting Index Pruning Methods for Clustering XML Collections

Contact Info

Product

Resources

About