Techniques of exploratory data analysis are used to study the weight of evidence that the occurrence of a query term provides in support of the hypothesis that a document is relevant to an information need. In particular, the relationship between the document frequency and the weight of evidence is investigated.A correlation between document frequency normalized by collection size and the mutual information between relevance and term occurrence is uncovered. This correlation is found to be robust across a variety of query sets and document collections. Based on this relationship, a theoretical explanation of the efficacy of inverse document frequency for term weighting is developed which differs in both style and content from theories previously put forth. The theory predicts that a "flattening" of idf at both low and high frequency should result in improved retrieval performance.This altered idf formulation is tested on all TREC query sets. Retrieval results corroborate the prediction of improved retrieval performance. In conclusion, we argue that exploratory data analysis can be a valuable tool for research whose goal is the development of an explanatory theory of information retrieval. IntroductionIn 1972, Spark Jones demonstrated that document frequency can be used effectively for the weighting of query terms [23]. Ever since, formulations of inverse document frequency have played a key role in information retrieval research. In this paper a theory of why inverse document frequency has been so effective is developed. Both the approach taken and the conclusions drawn differ from theories previously put forth. Employing techniques of exploratory data analysis EDA, the weight of evidence WOE in favor of relevance offered by query term occurrence is studied. The result is an explanatory theory of inverse document frequency, idf, derived from observed statistical regularities of extensive retrieval data. The work reported here is the first phase of a larger research project whose goal is the development of a retrieval formula that: 1) is explanatory, in that each component of the formula has a direct interpretation in terms of measurable statistical characteristics of identifiable retrieval objects (query terms, documents, etc.); 2) is supported by the careful observation and study of empirical Permission to make digital/hard copy of all or part of this work for per&ml or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantape. the coovriaht notice. the title of the oublication and its date appear, and n&ice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or fee. SIGIR'BB, Melbourne, Australia @ 1998 ACM l-58113-015-5 S/98 $5.00. data; and 3) yields retrieval performance comparable, if not superior, to current state of the art retrieval systems.The goal of this work is not primarily the production of an improved retrieval techni...
We describe an algorithm for choosing term weights to maximize average precision. The algorithm performs successive exhaustive searches through single directions in weight space. It makes use of a novel technique for considering all possible values of average precision that arise in searching for a maximum in a given direction. We apply the algorithm and compare this algorithm to a maximum entropy approach.
This paper takes a fresh look at modeling approaches to information retrieval that have been the basis of much of the probabilistically motivated IR research over the last 20 years. We shall adopt a subjectivist Bayesian view of probabilities and argue that classical work on probabilistic retrieval is best understood from this perspective. The main focus of the paper will be the ranking formulas corresponding to the Binary Independence Model (BIM), presented originally by Roberston and Sparck Jones [1977] and the Combination Match Model (CMM), developed shortly thereafter by Croft and Harper [1979]. We will show how these same ranking formulas can result from a probabilistic methodology commonly known as Maximum Entropy (MAXENT).
Selectional preferences have a long history in both generative and computational linguistics. However, since the publication of Resnik's dissertation in 1993, a new approach has surfaced in the computational linguistics community. This new line of research combines knowledge represented in a pre-defined semantic class hierarchy with statistical tools including information theory, statistical modeling, and Bayesian inference. These tools are used to learn selectional preferences from examples in a corpus. Instead of simple sets of semantic classes, selectional preferences are viewed as probability distributions over various entities. We survey research that extends Resnik's initial work, discuss the strengths and weaknesses of each approach, and show how they together form a cohesive line of research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.