The distributed information search component (Disco) and the World Wide Web

Tomasic, Anthony; Amouroux, Rémy; Bonnet, Philippe; Kapitskaia, Olga; Naacke, Hubert; Raschid, Louiqa

doi:10.1145/253260.253402

Cited by 38 publications

(25 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In its basic motivation, our work is inspired by previous work in the integration of heterogeneous data sources, such as data sources on the Web [Levy et al 1996b;Arens et al 1996;Garcia-Molina et al 1995;Atzeni et al 1997;Tomasic et al 1997;Bayardo et al 1997]. None of these previous systems, however, include a "fuzzy" matching procedure for names; instead they construct global domains using hand-crafted domain-specific normalization schemes, or domain-specific matching algorithms [Fang et al 1994].…”

Section: Related Workmentioning

confidence: 99%

“…Integration of distributed, heterogeneous databases, sometimes known as data integration, is an active area of research in the database community [Duschka and Genesereth 1997b;Levy et al 1996b;Arens et al 1996;Garcia-Molina et al 1995;Tomasic et al 1997;Bayardo et al 1997]. Largely inspired by the proliferation of database-like sources on the World Wide Web, previous researchers have addressed a diverse set of problems, ranging from access to "semi-structured" information sources [Suciu 1996;Abiteboul and Vianu 1997;Suciu 1997] to combining databases with differing schemata [Levy et al 1996a;Duschka and Genesereth 1997a].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Data integration using similarity joins and a word-based information representation language

Cohen

2000

ACM Trans. Inf. Syst.

159

102

View full text Add to dashboard Cite

The integration of distributed, heterogeneous databases, such as those available on the World Wide Web, poses many problems. Here we consider the problem of integrating data from sources that lack common object identifiers. A solution to this problem is proposed for databases that contain informal, natural-language "names" for objects; most Web-based databases satisfy this requirement, since they usually present their information to the end-user through a veneer of text. We describe WHIRL, a "soft" database management system which supports "similarity joins," based on certain robust, general-purpose similarity metrics for text. This enables fragments of text (e.g., informal names of objects) to be used as keys. WHIRL includes textual objects as a built-in type, similarity reasoning as a built-in predicate, and answers every query with a list of answer substitutions that are ranked according to an overall score. Experiments show that WHIRL is much faster than naive inference methods, even for short queries, and efficient on typical queries to real-world databases with tens of thousands of tuples. Inferences made by WHIRL are also surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outperforming exact matching with a plausible global domain on a second.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Data integration using similarity joins and a word-based information representation language

Cohen

2000

ACM Trans. Inf. Syst.

159

102

View full text Add to dashboard Cite

show abstract

“…The TSIMMIS project at Stanford addresses the problem of accessing non-standard data, notably semi-structured data, and proposes a flexible mediatorbased approach [3,8]. At INRIA, the Distributed Information Search Component (DISCO) has been developed [21,22]. However, all of these prototypes focus on heterogeneous query optimization and flexible data source integration using their proprietary middleware system.…”

Section: Related Workmentioning

confidence: 99%

Working together in Harmony-an implementation of the CORBA object query service and its evaluation

Röhm

Böhm

1999

Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337)

View full text Add to dashboard Cite

show abstract

“…Regardless of the number of cost dimensions, a centralized optimizer cannot accurately estimate the costs of operations at many autonomous sites. Garlic [23,40] and other middleware systems [24,46] address this problem by involving site-specific wrappers in the optimization process, but they do not consider the cost of communicating with these wrappers. This cost is not significant in these systems because the wrappers typically reside in the same address space as the optimizer.…”

Section: Decoupling Of Cost Estimationmentioning

confidence: 99%

“…The query optimization work goes back as far as the early distributed database systems (R*, SDD-1, Distributed Ingres [22,14,7]), and most recently has been focused on linking data sources of various capabilities and cost models [23,30,46]. However, query optimization in the broad federated environment presents peculiarities that change the trade-offs in the optimization process quite significantly.…”

Section: Introductionmentioning

confidence: 99%