2004
DOI: 10.1016/j.datak.2003.08.004
|View full text |Cite
|
Sign up to set email alerts
|

Efficient similarity-based operations for data integration

Abstract: Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper si… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
31
0

Year Published

2005
2005
2014
2014

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 43 publications
(33 citation statements)
references
References 18 publications
0
31
0
Order By: Relevance
“…Common techniques are euclidian or cosinus distance in vector space [17] or the editing distance for text [9,[17][18][19]. In the context of this work the techniques to distances in ontologies and tree structures are of significance.…”
Section: Feature Extraction From Ontologiesmentioning
confidence: 99%
“…Common techniques are euclidian or cosinus distance in vector space [17] or the editing distance for text [9,[17][18][19]. In the context of this work the techniques to distances in ontologies and tree structures are of significance.…”
Section: Feature Extraction From Ontologiesmentioning
confidence: 99%
“…They have shown that searches with one or no error perform several times better than agrep, but as they do not apply any filtering techniques, agrep outperforms their implementation for larger k. Schallehn et al [15] describe a prefix trie based index for similarity search, joins and group operations for Oracle DB. The authors introduce operators, all based on depth-first traversal, for duplicate detection in heterogenous integration scenarios that outperform non-indexed similarity operators.…”
Section: Related Workmentioning
confidence: 99%
“…Applications arise in duplicate detection [15], error correction [13] and data cleansing [5], to name only a few. They are also of uttermost importance in the Life Sciences.…”
Section: Introductionmentioning
confidence: 99%
“…More precisely, tool-links are mapped to a particular conditional join, the similarity join, in which data are joined if and only if they are very similar [17]. We considered several similarity functions based on those used by tools (Blast etc.).…”
Section: Towards a Meaning For Source-entities Pathsmentioning
confidence: 99%