A new algorithm for document clustering is introduced. The base concept of the algorithm, the cover coefficient (CC) concept, provides a means of estimating the number of clusters within a document database and relates indexing and clustering analytically. The CC concept is used also to identify the cluster seeds and to form clusters with these seeds. It is shown that the complexity of the clustering process is very low. The retrieval experiments show that the information-retrieval effectiveness of the algorithm is compatible with a very demanding complete linkage clustering method that is known to have good retrieval performance. The experiments also show that the algorithm is 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in cluster-based information retrieval. The experiments have validated the indexing-clustering relationships and the complexity of the algorithm and have shown improvements in retrieval effectiveness. In the expe; ments, two document databases are used: TODS214 and INSPEC. The latter is a common database with 12,684 documents.
In this article, two partitioning type clustering algorithms are presented. Both algorithms use the same method for selecting cluster seeds; however, assignment of documents to the seeds is different. The first algorithm uses a new concept called “cover coefficient,” and it is a single‐pass algorithm. The second one uses a conventional measure for document assignment to the cluster seeds and is a multipass algorithm. The concept of clustering, a model for seed oriented partitioning, the new centroid generation approach, and an illustration for both algorithms are also presented in the article.
An associative processor called RAP has been designed to provide hardware support for the use and manipulation of databases. RAP is particularly suited for supporting relational databases. In this paper, the relational operations provided by the RAP hardware are described, and a representative approach to providing the same relational operations with conventional software and hardware is devised. Analytic models are constructed for RAP and the conventional system. The execution times of several of the operations are shown to be vastly improved with RAP for large relations.
Indexing in information retrieval (IR) is used to obtain a suitable vocabulary of index terms and optimum assignment of these terms to documents for increasing the effectiveness and efficiency of an IR system. The concept of term discrimination value (TDV) is one of the criteria used for index-term selection. In this article a new concept called the cover coefficient (CC) will be used in computing TDVs. After a brief introduction to the theory of indexing and the CC concept, an efficient way of computing TDVs by use of the CC concept, index-term selection , and weight modification are discussed. It is also shown that the computational cost of the CC approach in the calculation of TDVs is favorably comparable to the cost of a different approach that uses similarity coefficients. Furthermore, the TDVs obtained by the CC approach are consistent with those of the latter approach.
In this paper, a new clustering algorithm has been described. The algorithm proposed determines both the number of clusters in a collection, and the number of elements in each cluster before beginning the final clustering process. The complexity assessment of the algorithm and the implementation issues are also emphasized.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.