ABSTRACT. Several graph theoretic cluster techniques aimed at the automatic generation of thesauri for information retrieval systems are explored. Experimental cluster analysis is performed on a sample corpus of 2267 documents. A term-term similarity matrix is constructed for the 3950 unique terms used to index the documents. "Various threshold values, T, are applied to the similarity matrix to provide a series of binary threshold matrices. The corresponding graph of each binary threshold matrix is used to obtain the term clusters.Three definitions of a cluster are analyzed: (1) the connected components of the threshold matrix; (2) the maximal complete subgraphs of the connected components of the threshold matrix; (3) clusters of the maximal complete subgraphs of the threshold matrix, as described by Gotlieb and Kumar.Algorithms are described and analyzed for obtaining each cluster type. The algorithms are designed to be useful for large document and index collections. Two algorithms have been tested that find maximal complete subgraphs. An algorithm developed by Bierstone offers a significant time improvement over one suggested by Bonner.For threshold levels T > 0.6, basically the same clusters are developed regardless of the cluster definition used. In such situations one need only find the connected components of the graph to develop the clusters.
We discuss how alternative methods of automatic term clustering may provide insight into how terms are related within a corpus. The work reported uses a corpus of 2267 documents that contain 3950 index terms. A similarity matrix is developed using the document–term matrix. A threshold level T is applied to the similarity matrix. Entries in the matrix that are greater than or equal to the threshold level are set equal to one, and the remaining entries are set to zero. Three definitions are applied to the corresponding graph of each threshold matrix to develop clusters. These are, (1) the connected components of the graph, (2) the maximal complete subgraphs of the graph, and (3) the combined maximal complete subgraphs of the graph as described by Gotlieb and Kumar. Two examples are described that show how insight may be gained into the term relations by varying the threshold levels and the cluster definitions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.