Documentclustering has not been well received as an information retrieval tool.
A simple method for categorizing texts into pre-determined text genre categories using the statistical standard technique of discriminant analysis is demonstrated with application to the Brown corpus. Discriminant analysis makes it possible use a large number of parameters that may be specific for a certain corpus or information stream, and combine them into a small number of functions, with the parameters weighted on basis of how useful they are for discriminating text genres. An application to information retrieval is discussed. Text TypesThere are different types of text. Texts "about" the same thing may be in differing genres, of different types, and of varying quality. Texts vary along several parameters, all relevant for the general information retrieval problem of matching reader needs and texts. Given this variation, in a text retrieval context the problems are (i) identifying genres, and (ii) choosing criteria to cluster texts of the same genre, with predictable precision and recall. This should not be confused with the issue of identifying topics, and choosing criteria that discriminate one topic from another. Although not orthogonal to genre-dependent variation, the variation that relates directly to content and topic is along other dimensions. Naturally, there is co-variance. Texts about certain topics may only occur in certain genres, and texts in certain genres may only treat certain topics; most topics do, however, occur in several genres, which is what interests us here. Douglas Biber has studied text variation along several parameters, and found that texts can be considered to vary along five dimensions. In his study, he clusters features according to covariance, to find underlying dimensions (1989). We wish to find a method for identifying easily computable parameters that rapidly classify previously unseen texts in general classes and along a small set -smaller than Biber's five -of dimensions, such that they can be explained in intuitively simple terms to the user of an information retrieval application. Our aim is to take a set of texts that has been selected by some sort of crude semantic analysis such as is typically performed by an information retrieval system and partition it further by genre or text type, and to display this variation as simply as possible in one or two dimensions. MethodWe start by using features similar to those first investigated by Biber, but we concentrate on those that are easy to compute assuming we have a part of speech tagger Church, 1988), such as such as third person pronoun occurrence rate as opposed to 'general hedges ' (Biber, 1989). More and more of Biber's features will be available with the advent of more proficient analysis programs, for instance if complete surface syntactic parsing were performed before categorization (Voutilainen & Tapanainen, 1993).We then use discriminant analysis, a technique from descriptive statistics. Discriminant analysis takes a set of precategorized individuals and data on their variation on a number of parameters, ...
The Scatter/Gather document browsing method uses fast document clustering to produce table-of-contentslike outlines of large document collections. Previous work [I] developed linear-time document clustering algorithms to establish the feasibility of this method over moderately large collections. However, even linear-time algorithms are too slow to support, interactive browsing of very large collections such as Tipster. the DARPA st,andard text retrieval evaluation collection. We present a scheme that supports constant interaction-time Scatter/Gather of arbitrarily large collections after nearlinear time preprocessing. This involves the construction of a cluster hzerarch,y. A modification of Scatter/Gather employing this scheme, and an example of its use over the Tipster collection are presented.
For free-text search over rapidly evolving corpora, dy-
Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval. We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs docum-ent clustering as its primary operation. We also present fast (linear time) clustering algorithm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.