With the increasing importance of XML, LDAP directories, and text-based information sources on the Internet, there is an ever-greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the multiple dimensions. Effective query optimization in this context requires good selectivity estimates. In this paper, we use pruned count-suffix trees (PSTs) as the basic data structure for substring selectivity estimation. For the 1-D problem, we present a novel technique called MO (Maximal Overlap). We then develop and analyze two 1-D estimation algorithms, MOC and MOLC, based on MO and a constraint-based characterization of all possible completions of a given PST. For the k-D problem, we first generalize PSTs to multiple dimensions and develop a space-and time-efficient probabilistic algorithm to construct k-D PSTs directly. We then show how to extend MO to multiple dimensions. Finally, we demonstrate, both analytically and experimentally, that MO is both practical and substantially superior to competing algorithms.
The Distributed Information Search COmponent (DISCO) is a prototype heterogeneous distributed database that accesses underlying data sources. The DISCO prototype currently focuses on three central research problems in the context of these systems. First, since the capabilities of each data source is different, transforming queries into subqueries on data source is difficult. We call this problem the weak data source problem. Second, since each data source performs operations in a generally unique way, the cost for performing an operation may vary radically from one wrapper to another. We call this problem the radical cost problem. Finally, existing systems behave rudely when attempting to access an unavailable data source. We call this problem the urzgrace~rd jadur-e problem.DISCO copes with these problems. For the weak data source problem, the database implementor defines precisely the capabilities of each data source. For the radical cost problem, the database implementor (optionally) defines cost information for some of the operations of a data source. The mediator uses this cost information to improve its cost model. To deal with ungraceful failures, queries return partial answers. A partial answer contains the part of the final answer to the query that was produced by the available data sources. The current working prototype of D tsco cent ains implementations of these solutions and operations over a collection of wrappers that access information both in files and on the World Wide Web.
LDAP (Lightweight Directory Access Protocol) directories have recently proliferated with the growth of the Internet, and are being used in a wide variety of network-based applications to store data such as personal profiles, address books, and network and service policies.These systems provide a means for managing heterogeneity in a way far superior to what conventional relational or object-oriented databases can offer.To achieve fast performance for declarative query answering, it is desirable to use client caching based on semantic information (instead of individual directory entries).We formally consider the problem of reusing cached LDAP directory entries for answering declarative LDAP queries.A semantic LDAP directory cache contains directory entries, which are semantically described by a set of query templates.We show that, for conjunctive queries and LDAP directory caches with positive templates, the complexity of cache-answerability is NP-complete in the size of the query. For this case, we design a sound and complete algorithm for cache-answerability based on a suite of query transformations that capture the semantics of LDAP queries.We demonstrate the practicality of this algorithm for real applications with a performance evaluation, based on sample queries from a directory enabled application at AT&T Labs. When the query templates in the cache contain negation, we show that the complexity of cache-answerability of conjunctive LDAP queries is co-NP complete in the size of the schema and query templates in the semantic description of the cache.Finally, we identify natural restrictions on the nature of the semantic descriptions for polynomial-time cache-answerability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.