This paper reports on a novel technique for literature indexing and searching in a mechanized library system. The notion of relevance is taken as the key concept in the theory of information retrieval and a comparative concept of relevance is explicated in terms of the theory of probability. The resulting technique called “Probabilistic Indexing,” allows a computing machine, given a request for information, to make a statistical inference and derive a number (called the “relevance number”) for each document, which is a measure of the probability that the document will satisfy the given request. The result of a search is an ordered list of those documents which satisfy the request ranked according to their probable relevance. The paper goes on to show that whereas in a conventional library system the cross-referencing (“see” and “see also”) is based solely on the “semantical closeness” between index terms, statistical measures of closeness between index terms can be defined and computed. Thus, given an arbitrary request consisting of one (or many) index term(s), a machine can elaborate on it to increase the probability of selecting relevant documents that would not otherwise have been selected. Finally, the paper suggests an interpretation of the whole library problem as one where the request is considered as a clue on the basis of which the library system makes a concatenated statistical inference in order to provide as an output an ordered list of those documents which most probably satisfy the information needs of the user.
All evaluation of a large, operational full-text document-retrieval system (containing roughly 350,000 pages of text) shows the system to be retrieving less than 20 percent of the documents relevant to a particular search. The findings are discussed in terms of the theory and practice of full-text document retrieval. Marchl985Volume 28 Number 3 Cotnmunications of the ACM
This inquiry examines a technique for automatically classifying (indexing) documents according to their subject content. The task, in essence, is to have a computing machine read a document and on the basis of the occurrence of selected clue words decide to which of many subject categories the document in question belongs. This paper describes the design, execution and evaluation of a modest experimental study aimed at testing empirically one statistical technique for automatic indexing.
The primary objective of this paper is to examine the concept of about as it is used in its information retrieval sense when, for example, an indexer judges that a document is (or is not) about some given subject. The problem with about is that it is a very complex notion and we are unable to say precisely what it is we do when we make judgment of aboutness. Since about is at the heart of indexing, how are we to formulate any proper theory of indexing if we cannot explicate precisely the key concept of about? In this paper we look at this concept of about and offer a solution to the problem mentioned; it consists of an operational definition of about which interprets about in terms of search behavior.A second objective of this paper is to show that about is, in fact, not the central concept in a theory of document retrieval. A document retrieval system ought to provide a ranked output (in response to a search query) not according to the degree that they are about the topic sought by the inquiring patron, but rather according to the probability that they will satisfy that person's information need. This paper shows how aboutness is related to probability of satisfaction.
One of the most perplexing problems of reformation retrieval has been the estabhshment of rational criteria for deciding what index terms or descriptors to assign to a unit of stored information for purposes of later retrieval Both probablhstJc and utlhty-theoretlc criteria have m the past been proposed for thts purpose. The present paper derives explicit decision rules of both kinds from a common conceptual and mathematical foundation The result IS a unified theory of indexing KEY WORDS AND PHRASES indexing, cataloging, classification, index terms, descriptors, information retrieval, document retrieval, reference retrieval, utlhty-theoretlc indexing, probabthstlc indexing CR CATEGORIES" 3 70, 3 71, 3 72, 3 75The question of how to index documents is widely regarded as a mare issue, if indeed not the central theoretical problem, of the subfield of information retrieval known as document or reference retrieval. The problem setting is as follows. There exists a large document collection on the one hand, and on the other a population of individuals (potential retrieval system patrons) each of whom needs or wants reformation he thinks m~ght be supphed by documents m the collection. The indexing problem is: How should the documents in the collection be identified ("indexed," "cataloged," etc ) so that the collection can be searched to the maximal collective benefit of the patrons?In 1960 one of us (Maron), in collaboration with J.L. Kuhns, addressed this question and developed a theory of indexing known as Probabilistic Indexing ([14]; cf. [11,12]). The theory interpreted the indexing operation m such a way that a document retrieval system could use index information to compute and rank output documents according to the probabdity that each would satisfy the inqmring patron. More recently, Cooper developed another theory of indexing which might appropriately be called Utility-Theoretic Indexing since it ts based on the precepts of utility theory, including the rudiments of decision theory [5] Utihty-Theoretic Indexing is predicated on the assumption that index terms should be assigned to documents in such a way as to reflect the utility (or value) that the document m question would be expected to provide to the patron searching under the term in question. Related approaches have been explored by Bookstein and Swanson [1], Harter [8], Kraft [10], Kochen [9], and others. The purpose of this paper is to explain and clarify the conceptual foundations common to both Probabilistic and Utility-Theoretic Indexing, and to show how the two theories complement one another.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.