2007
DOI: 10.1007/978-3-540-44918-8_6
|View full text |Cite
|
Sign up to set email alerts
|

Quality and Complexity Measures for Data Linkage and Deduplication

Abstract: Abstract. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
126
0
3

Year Published

2009
2009
2020
2020

Publication Types

Select...
3
3
1

Relationship

2
5

Authors

Journals

citations
Cited by 151 publications
(130 citation statements)
references
References 33 publications
0
126
0
3
Order By: Relevance
“…Performance measures in this case are often defined as the functions of the number of true positives, false positives, etc [12,13]. For example, in addition to precision, recall and F-measures, the performance of a record linkage algorithm can also be measured using Accuracy = (TP + TN) / (TP + FP + TN + FN), among others [13].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Performance measures in this case are often defined as the functions of the number of true positives, false positives, etc [12,13]. For example, in addition to precision, recall and F-measures, the performance of a record linkage algorithm can also be measured using Accuracy = (TP + TN) / (TP + FP + TN + FN), among others [13].…”
Section: Related Workmentioning
confidence: 99%
“…Performance measures in this case are often defined as the functions of the number of true positives, false positives, etc [12,13]. For example, in addition to precision, recall and F-measures, the performance of a record linkage algorithm can also be measured using Accuracy = (TP + TN) / (TP + FP + TN + FN), among others [13]. In these proposals, performance scores are obtained for particular applications of a record linkage algorithm on actual data sets, and are mainly used as a mechanism to tune the parameters (e.g., matching threshold) of the algorithm.…”
Section: Related Workmentioning
confidence: 99%
“…Data matching is the process of linking and aggregating records that refer to the same entity from one or more databases [2,3]. A variety of techniques for data matching have been developed in different fields in the past, and while computer scientists speak of data or record matching, or entity resolution, statisticians and health researchers refer to data or record linkage, and the database and business oriented IT communities call this process data cleaning or cleansing, ETL (extraction, transformation and loading), object identification, or merge/purge processing.…”
Section: Data Matchingmentioning
confidence: 99%
“…Thus, the total number of potential comparisons is of quadratic complexity. On the other hand, most of these comparisons correspond to non-matches, because the maximum number of matches can only be in the order of the number of records in the smaller of the two databases to be matched (assuming these databases do not contain duplicate records) [2]. So, while the computational efforts potentially increase quadratically with the size of the databases to be matched, the maximum number of true matches only increases linearly.…”
Section: Blocking or Indexingmentioning
confidence: 99%
See 1 more Smart Citation