Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats or any combination of these factors. In this article, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar eld entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the ef ciency and scalability of approximate duplicate detection algorithms. We conclude with a coverage of existing tools and with a brief discussion of the big open problems in the area.
We provide here an overview of the new and rapidly emerging research area of privacy preserving data mining. We also propose a classi cation hierarchy that sets the basis for analyzing the work which has been performed in this context. A detailed review of the work accomplished in this area is also given, along with the coordinates of each work to the classi cation hierarchy. A brief evaluation is performed, and some initial conclusions are made.
Large repositories of data contain sensitive information which must be protected against unauthorized access. The protection of the confidentiality of tills information has been a long-term goal for the database security research community and the government statistical agencies. Recent advances, in data mining and machine learning algorithms, have increased the disclosure risks one may encounter when releasing data to outside parties.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.