Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data 2003
DOI: 10.1145/872757.872796
|View full text |Cite
|
Sign up to set email alerts
|

Robust and efficient fuzzy match for online data cleaning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
276
0
10

Year Published

2004
2004
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 343 publications
(287 citation statements)
references
References 12 publications
1
276
0
10
Order By: Relevance
“…An alternative to standard blocking is the sorted neighbourhood [24] approach, where records are sorted according to the values of the blocking variable, then a sliding window is moved over the sorted records, and comparisons are performed between the records within the window. Newer experimental approaches based on approximate q-gram indices [2,7] or high-dimensional clustering [29] are current research topics. The effects of blocking upon the quality and complexity of the data linkage process are discussed in more details in Section 5.…”
Section: Data Linkage Processmentioning
confidence: 99%
See 1 more Smart Citation
“…An alternative to standard blocking is the sorted neighbourhood [24] approach, where records are sorted according to the values of the blocking variable, then a sliding window is moved over the sorted records, and comparisons are performed between the records within the window. Newer experimental approaches based on approximate q-gram indices [2,7] or high-dimensional clustering [29] are current research topics. The effects of blocking upon the quality and complexity of the data linkage process are discussed in more details in Section 5.…”
Section: Data Linkage Processmentioning
confidence: 99%
“…Their results on various data sets show that learned edit distance resulted in improved precision and recall results. Very similar approaches are presented in [7,30,46,47], with [30] using support vector machines for the binary classification task of record pairs. As shown in [12], combining different learned string comparison methods can result in improved linkage classification.…”
Section: Modern Approachesmentioning
confidence: 99%
“…Recognizing the importance of this problem, a variety of commercial products (see, e.g., [1]) and research prototypes (see, e.g., [21]) target the space of data cleaning, offering an array of techniques to identify and correct data quality problems. At a high level, data cleaning solutions can be classified into two broad categories: (a) those that operate on top of an RDBMS using an SQL interface to express and realize data cleaning tasks [16,17,18,7] and (b) those that extract the relevant data out of a database and operate on them using proprietary techniques and interfaces [1].…”
Section: Permission To Copy Without Fee All or Part Of This Materials mentioning
confidence: 99%
“…which abound in customer related databases. These include techniques for indexed retrieval of strings based on notions of approximate string match [7], correlating string attributes using string similarity predicates (e.g., cosine similarity, edit distance and variants thereof) [17,18,6,5] and deploying algorithms and/or rule engines for automatically correcting/transforming strings into canonical forms [1, 3,22]. These techniques will successfully match a query string (or a collection of strings) approximately (for suitably defined notions of approximate match) against the values of an attribute in a relation R. For a given query string, such techniques can tag each matching attribute value with a score quantifying the degree of similarity (closeness) of the query string to the attribute value string.…”
Section: Permission To Copy Without Fee All or Part Of This Materials mentioning
confidence: 99%
See 1 more Smart Citation