2013
DOI: 10.1145/2505420.2505424
|View full text |Cite
|
Sign up to set email alerts
|

An automatic blocking strategy for XML duplicate detection

Abstract: Duplicate detection consists in finding objects that, although having different representations in a database, correspond to the same real world entity. This is typically achieved by comparing all objects to each other, which can be unfeasible for large datasets. Blocking strategies have been devised to reduce the number of objects to compare, at the cost of loosing some duplicates. However, these strategies typically rely on user knowledge to discover a set of parameters that optimize the comparisons, while m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 15 publications
0
3
0
Order By: Relevance
“…The focus of the research community in the last years was mainly oriented to Entity Resolution (ER), the task concerning the development of techniques for detecting and merging entities. A number of "integration functions" to discover and match the different structures that represent the same real-world entity have been proposed [46,64,32,3,52,49,20,69,35]. Among these, rule-based and machine learning (ML) techniques are the most common ones.…”
Section: Data Integration and Entity Resolutionmentioning
confidence: 99%
“…The focus of the research community in the last years was mainly oriented to Entity Resolution (ER), the task concerning the development of techniques for detecting and merging entities. A number of "integration functions" to discover and match the different structures that represent the same real-world entity have been proposed [46,64,32,3,52,49,20,69,35]. Among these, rule-based and machine learning (ML) techniques are the most common ones.…”
Section: Data Integration and Entity Resolutionmentioning
confidence: 99%
“…Thus, the data needs cleaning before it can be used for any analytics. State-of-the-art in duplicate detection in semi-structured data has improved significantly due to recent studies [75], [71], [73], [72]. However, the Big Data phenomenon brings novel set of challenges for detecting duplicates in semi-structured data such as: (1) Dealing with multisourced heterogeneous data sets in a timely manner, and (2) Enabling duplicate detection over large data sets with a high throughput.…”
Section: Introductionmentioning
confidence: 99%
“…In order to optimize the duplicate detection process some studies exploit the hierarchical structure of semi-structured objects [71], [73], [72]. A detailed overview of the duplicate detection work is provided in [98].…”
Section: Introductionmentioning
confidence: 99%