Database de-anonymization typically involves matching an anonymized database with correlated publicly available data. Existing research focuses either on practical aspects without requiring knowledge of the data distribution yet provides limited guarantees, or on theoretical aspects assuming known distributions. This paper aims to bridge these two approaches, offering theoretical guarantees for database de-anonymization under synchronization errors and obfuscation without prior knowledge of data distribution. Using a modified replica detection algorithm and a new seeded deletion detection algorithm, we establish sufficient conditions on the database growth rate for successful matching, demonstrating a double-logarithmic seed size relative to row size is sufficient for detecting deletions in the database. Importantly, our findings indicate that these sufficient de-anonymization conditions are tight and are the same as in the distribution-aware setting, avoiding asymptotic performance loss due to unknown distributions. Finally, we evaluate the performance of our proposed algorithms through simulations, confirming their effectiveness in more practical, non-asymptotic, scenarios.
The re-identification or de-anonymization of users from anonymized data through matching with publiclyavailable correlated user data has raised privacy concerns, leading to the complementary measure of obfuscation in addition to anonymization. Recent research provides a fundamental understanding of the conditions under which privacy attacks, in the form of database matching, are successful in the presence of obfuscation. Motivated by synchronization errors stemming from the sampling of time-indexed databases, this paper presents a unified framework considering both obfuscation and synchronization errors and investigates the matching of databases under noisy entry repetitions. By investigating different structures for the repetition pattern, replica detection and seeded deletion detection algorithms are devised and sufficient and necessary conditions for successful matching are derived. Finally, the impacts of some variations of the underlying assumptions, such as adversarial deletion model, seedless database matching and zero-rate regime, on the results are discussed. Overall, our results provide insights into the privacy-preserving publication of anonymized and obfuscated time-indexed data as well as the closely-related problem of the capacity of synchronization channels.
Motivated by synchronization errors in the sampling of time-indexed databases, matching of random databases under random column repetitions (including deletions) is investigated. Column histograms are used as a permutation-invariant feature to detect the repetition pattern, whose asymptotic-uniqueness is proved using information-theoretic tools. Repetition detection is followed by a row matching scheme. Considering this overall scheme, sufficient conditions for successful database matching in terms of the database growth rate are derived. A modified version of Fano's inequality leads to a tight necessary condition for successful matching, establishing the matching capacity under column repetitions. This capacity is equal to the erasure bound, which assumes the repetition locations are known a-priori.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.