Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. In this paper, we present a generalized framework for duplicate detection, dividing the problem into three components: candidate definition defining which objects are to be compared, duplicate definition defining when two duplicate candidates are in fact duplicates, and duplicate detection specifying how to efficiently find those duplicates.Using this framework, we propose an XML duplicate detection method, DogmatiX, which compares XML elements based not only on their direct data values, but also on the similarity of their parents, children, structure, etc. We propose heuristics to determine which of these to choose, as well as a similarity measure specifically geared towards the XML data model. An evaluation of our algorithm using several heuristics validates our approach. XML DUPLICATE DETECTIONDuplicate detection is the problem of determining that different representations of entities in a data source actually represent the same real-world entity. The most prominent application area for duplicate detection is customer relationship management (CRM), where multiple entries of the same customer can result in multiple mailings to the same person, incorrect aggregation of sales to a certain customer, etc. Other application areas include bioinformatics, catalog integration, and in general any domain where independently collected data is integrated.The problem has been addressed extensively for relational data stored in tables. However, more and more of today's data is represented in non-relational form. In particular, XML is increasingly popular, especially for data published on the Web and data exchanged between organizations. Conventional methods do not trivially adapt, so there is a Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. need for methods to detect duplicate objects in nested XML data. XML data is semi-structured and is organized hierarchically. This complicates the object identification task, compared to relational data that is flat and usually wellstructured. We face two problems: object definition and structural diversity.Object definition refers to the problem of defining which data values actually describe an object, i.e., which values to consider when comparing two objects. Methods for relational duplicate detection assume that each tuple represents an object and all attribute values describe that object. Sufficiently similar data values of two tu...
Fuzzy duplicate detection aims at identifying multiple representations of real-world objects stored in a data source, and is a task of critical practical relevance in data cleaning, data mining, or data integration. It has a long history for relational data stored in a single table (or in multiple tables with equal schema). Algorithms for fuzzy duplicate detection in more complex structures, e.g., hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the duplicate status of their direct neighbors, e.g., children in hierarchical data, to improve duplicate detection effectiveness. In this paper, we propose a novel method for fuzzy duplicate detection in hierarchical and semi-structured XML data. Unlike previous approaches, it not only considers the duplicate status of children, but rather the probability of descendants being duplicates. Probabilities are computed efficiently using a Bayesian network. Experiments show the proposed algorithm is able to maintain high precision and recall values, even when dealing with data containing a high amount of errors and missing information. Our proposal is also able to outperform a state-of-the-art duplicate detection system on three different XML databases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.