1990
DOI: 10.1145/99935.99938
|View full text |Cite
|
Sign up to set email alerts
|

Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases

Abstract: A new algorithm for document clustering is introduced. The base concept of the algorithm, the cover coefficient (CC) concept, provides a means of estimating the number of clusters within a document database and relates indexing and clustering analytically. The CC concept is used also to identify the cluster seeds and to form clusters with these seeds. It is shown that the complexity of the clustering process is very low. The retrieval experiments show that the information-retrieval effectiveness of the algorit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
80
0
21

Year Published

2007
2007
2011
2011

Publication Types

Select...
4
3
1

Relationship

5
3

Authors

Journals

citations
Cited by 104 publications
(102 citation statements)
references
References 21 publications
1
80
0
21
Order By: Relevance
“…Originally, Yao's formula determines the number of disk pages to be accessed to retrieve the related records of a query under the assumption that database records are randomly distributed among the same size pages. Later Can and Ozkarahan [6] adapted the formula for environments for pages (clusters) with different sizes. For using Yao's formula in our problem we treat the individual clusters of C s as queries and determine how their members (like the related documents of a query) are distributed in the clustering structure C t .…”
Section: The Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Originally, Yao's formula determines the number of disk pages to be accessed to retrieve the related records of a query under the assumption that database records are randomly distributed among the same size pages. Later Can and Ozkarahan [6] adapted the formula for environments for pages (clusters) with different sizes. For using Yao's formula in our problem we treat the individual clusters of C s as queries and determine how their members (like the related documents of a query) are distributed in the clustering structure C t .…”
Section: The Methodsmentioning
confidence: 99%
“…We refer to the entity n t as the Translation Relationship Index (TRI) and check the merit of the index by comparing it with the value of n tr . The existence of n tr , which can be directly computed by the modified Yao's formula [6], gives TRI the attribute of a measurement criterion, since n tr provides a benchmark or a reference point. If the observed TRI value indicates that the relationship is different from random (i.e., if n t is smaller than n tr ), we obtain the baseline distribution for n tr using the Monte Carlo approach to decide if the difference is significant.…”
Section: The Methodsmentioning
confidence: 99%
“…It is a seed oriented, partitioning, singlepass, linear-time clustering algorithm introduced in [3]. The main goal of C 3 M is to convey the relationships among documents using a two-stage probability experiment.…”
Section: Clusteringmentioning
confidence: 99%
“…If none of the seeds covers the non-seed document, then, it is directly added to the Others cluster. Detailed information about C 3 M can be found in [3]. Modified sequential k-means algorithm.…”
Section: Clusteringmentioning
confidence: 99%
“…More specifically, we use a partitioning type clustering algorithm, so-called Cover-Coefficient Based Clustering Methodology (C 3 M) [7], along with some index pruning techniques for clustering XML documents.…”
Section: Introductionmentioning
confidence: 99%