2011
DOI: 10.1145/2000824.2000825
|View full text |Cite
|
Sign up to set email alerts
|

Efficient similarity joins for near-duplicate detection

Abstract: With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token orde… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
395
0
14

Year Published

2011
2011
2020
2020

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 355 publications
(420 citation statements)
references
References 46 publications
0
395
0
14
Order By: Relevance
“…In order to solve this problem, some duplicate detection systems used active learning techniques to automatically locate such ambiguous pairs. ALIAS [25] is learning based duplicate detection system which uses the idea of a "reject region" to significantly reduce the size of the training set. In [26] is used a similar strategy and employed decision trees to teach rules for matching records with multiple fields.…”
Section: Active-learning-based Approachesmentioning
confidence: 99%
See 1 more Smart Citation
“…In order to solve this problem, some duplicate detection systems used active learning techniques to automatically locate such ambiguous pairs. ALIAS [25] is learning based duplicate detection system which uses the idea of a "reject region" to significantly reduce the size of the training set. In [26] is used a similar strategy and employed decision trees to teach rules for matching records with multiple fields.…”
Section: Active-learning-based Approachesmentioning
confidence: 99%
“…Until now, there are already great deals of research works [6][7][8][9][10][11] on duplicate detection. They try to map duplicates between two sources, which result in Cn 2 implementations of duplicate detectors towards n total sources.…”
Section: Introductionmentioning
confidence: 99%
“…study the problem on how to efficiently extract K pairs of records, which are most similar to each other. In [3,28,34,36,37,39], they focus on how to efficiently extract all records with record scores greater than a pre-specified threshold. -Method 2: Pre-specify a threshold for each individual attribute such that each record, whose attribute score over the corresponding attribute is not less than the pre-specified threshold, is regarded to refer to the same entity with the search query.…”
Section: Related Workmentioning
confidence: 99%
“…In essence, the objective of these methods is to identify these similar strings by scanning as less number of records as possible. In [1,5,13,21,23,24,34], the main approaches are based on the inverted indices and a variety of effective filtering techniques. In [1,5,24], they focus on how to skip strings as many as possible during the merging of inverted lists.…”
Section: Approximate String Searchmentioning
confidence: 99%
“…Defining the similarity of each record properly is required and necessary. In [23,24,31,[33][34][35], they concatenate attribute values of the same record into a single string and the similarity of each record is defined using a given similarity function. …”
Section: Introductionmentioning
confidence: 99%