We present a probabilistic relational algebra (PRA) which is a generalization of standard relational algebra. In PRA, tuples are assigned probabilistic weights giving the probability that a tuple belongs to a relation. Based on intensional semantics, the tuple weights of the result of a PRA expression always conform to the underlying probabilistic model. We also show for which expressions extensional semantics yields the same results. Furthermore, we discuss complexity issues and indicate possibilities for optimization. With regard to databases, the approach allows for representing imprecise attribute values, whereas for information retrieval, probabilistic document indexing and probabilistic search term weighting can be modeled. We introduce the concept of vague predicates which yield probabilistic weights instead of Boolean values, thus allowing for queries with vague selection conditions. With these features, PRA implements uncertainty and vagueness in combination with the relational model.
In this paper, an introduction and survey over probabilistic information retrieval (IR) is given. First, the basic concepts of this approach are described: the probability-ranking principle shows that optimum retrieval quality can be achieved under certain assumptions; a conceptual model for IR along with the corresponding event space clarify the interpretation of the probabilistic parameters involved. For the estimation of these parameters, three different learning strategies are distinguished, namely query-related, document-related and description-related learning. As a representative for each of these strategies, a specific model is described. A new approach regards IR as uncertain inference; here, imaging is used as a new technique for estimating the probabilistic parameters, and probabilistic inference networks support more complex forms of inference. Finally, the more general problems of parameter estimations, query expansion and the development of models for advanced document representations are discussed.
We describe a method for probabilistic document indexing using relevance feedback data that has been collected from a set of queries. Our approach is based on three new concepts: (1) Abstraction from specific terms and documents, which overcomes the restriction of limited relevance information for parameter estimation.(2) Flexibility of the representation, which allows the integration of new text analysis and knowledge-based methods in our approach as well as the consideration of document structures or different types of terms.(3) Probabilistic learning or classification methods for the estimation of the indexing weights making better use of the available relevance information, Our approach can be applied under restrictions that hold for real applications.We give experimental results for five test collections which show improvements over other methods.
In networked IR, a client submits a query to a broker, which is in contact with a large number of databases. In order to yield a maximum number of documents at minimum cost, the broker has to make estimates about the retrieval cost of each database, and then decide for each database whether or not to use it for the current query, and if, how many documents to retrieve from it. For this purpose, we develop a general decision-theoretic model and discuss different cost structures. Besides cost for retrieving relevant versus nonrelevant documents, we consider the following parameters for each database: expected retrieval quality, expected number of relevant documents in the database, and cost factors for query processing and document delivery. For computing the overall optimum, a divide-and-conquer algorithm is given. If there are several brokers knowing different databases, a preselection of brokers can only be performed heuristically, but the computation of the optimum can be done similarly to the single-broker case. In addition, we derive a formula which estimates the number of relevant documents in a database based on dictionary information.
In the logical approach to information retrieval (IR), retrieval is considered as uncertain inference. Whereas classical IR models are based on propositional logic, we combine Datalog (function‐free Horn clause predicate logic) with probability theory. Therefore, probabilistic weights may be attached to both facts and rules. The underlying semantics extends the well‐founded semantics of modularly stratified Datalog to a possible worlds semantics. By using default independence assumptions with explicit specification of disjoint events, the inference process always yields point probabilities. We describe an evaluation method and present an implementation. This approach allows for easy formulation of specific retrieval models for arbitrary applications, and classical probabilistic IR models can be implemented by specifying the appropriate rules. In comparison to other approaches, the possibility of recursive rules allows for more powerful inferences, and predicate logic gives the expressiveness required for multimedia retrieval. Furthermore, probabilistic Datalog can be used as a query language for integrated information retrieval and database systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.