Open tools for quantitative anonymization of tabular phenotype data: literature review

Haber, Anna C; Sax, Ulrich; Praßer, Fabian

doi:10.1093/bib/bbac440

Cited by 10 publications

(11 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While an increasing number of examples of real-world applications of anonymization algorithms are published [ 12 , 15 , 16 ], we did not come across any investigations that measured the reproducibility (eg, by 95% CI overlap) of descriptive real-world analyses except for prior work on the GCKD study. However, several studies focusing on preserving the utility of anonymized data for descriptive real-world analyses without explicitly introducing use case–specific measures have been published.…”

Section: Discussionmentioning

confidence: 99%

“…Anonymization can be performed using various transformation mechanisms, such as suppression, randomization, or generalization. Software-enabled solutions have been developed with implementations of published algorithms to support this process [ 12 ]. Yet, there is an inherent trade-off between the reduction of privacy risks and the utility of the data that can be shared [ 13 ].…”

Section: Introductionmentioning

confidence: 99%

“…This challenge has been studied extensively in theory [ 14 ], and the evidence for utility-preserving anonymization is growing [ 12 , 15 - 18 ]. However, anonymization has not been broadly adopted in clinical practice.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

The Costs of Anonymization: Case Study Using Clinical Data

Pilgram,

Meurers,

Malin

et al. 2024

J Med Internet Res

Self Cite

View full text Add to dashboard Cite

Background Sharing data from clinical studies can accelerate scientific progress, improve transparency, and increase the potential for innovation and collaboration. However, privacy concerns remain a barrier to data sharing. Certain concerns, such as reidentification risk, can be addressed through the application of anonymization algorithms, whereby data are altered so that it is no longer reasonably related to a person. Yet, such alterations have the potential to influence the data set’s statistical properties, such that the privacy-utility trade-off must be considered. This has been studied in theory, but evidence based on real-world individual-level clinical data is rare, and anonymization has not broadly been adopted in clinical practice. Objective The goal of this study is to contribute to a better understanding of anonymization in the real world by comprehensively evaluating the privacy-utility trade-off of differently anonymized data using data and scientific results from the German Chronic Kidney Disease (GCKD) study. Methods The GCKD data set extracted for this study consists of 5217 records and 70 variables. A 2-step procedure was followed to determine which variables constituted reidentification risks. To capture a large portion of the risk-utility space, we decided on risk thresholds ranging from 0.02 to 1. The data were then transformed via generalization and suppression, and the anonymization process was varied using a generic and a use case–specific configuration. To assess the utility of the anonymized GCKD data, general-purpose metrics (ie, data granularity and entropy), as well as use case–specific metrics (ie, reproducibility), were applied. Reproducibility was assessed by measuring the overlap of the 95% CI lengths between anonymized and original results. Results Reproducibility measured by 95% CI overlap was higher than utility obtained from general-purpose metrics. For example, granularity varied between 68.2% and 87.6%, and entropy varied between 25.5% and 46.2%, whereas the average 95% CI overlap was above 90% for all risk thresholds applied. A nonoverlapping 95% CI was detected in 6 estimates across all analyses, but the overwhelming majority of estimates exhibited an overlap over 50%. The use case–specific configuration outperformed the generic one in terms of actual utility (ie, reproducibility) at the same level of privacy. Conclusions Our results illustrate the challenges that anonymization faces when aiming to support multiple likely and possibly competing uses, while use case–specific anonymization can provide greater utility. This aspect should be taken into account when evaluating the associated costs of anonymized data and attempting to maintain sufficiently high levels of privacy for anonymized data. Trial Registration German Clinical Trials Register DRKS00003971; https://drks.de/search/en/trial/DRKS00003971 International Registered Report Identifier (IRRID) RR2-10.1093/ndt/gfr456

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Costs of Anonymization: Case Study Using Clinical Data

Pilgram,

Meurers,

Malin

et al. 2024

J Med Internet Res

Self Cite

View full text Add to dashboard Cite

show abstract

“…Furthermore, one should reduce the amount of detail when it comes to meta data. 205 For example, instead of reporting a table with the exact ages of participants, a range can be reported instead. Finally, many departments and universities employ data stewards or data protection managers that can advise researchers on how to comply with local and national data sharing policies and implement FAIR data sharing principles.…”

Section: Data and Code Availabilitymentioning

confidence: 99%

“…We further recommend avoiding sharing data that is not essential for the research question or follow-up analyses but has a high disclosure risk (e.g., an unusual finding). Furthermore, one should reduce the amount of detail when it comes to meta data 205 . For example, instead of reporting a table with the exact ages of participants, a range can be reported instead.…”

Section: Step-by-step Fnirs Study Designmentioning

confidence: 99%

Using preregistration as a tool for transparent fNIRS study design

et al. 2023

View full text Add to dashboard Cite

Significance: The expansion of functional near-infrared spectroscopy (fNIRS) methodology and analysis tools gives rise to various design and analytical decisions that researchers have to make. Several recent efforts have developed guidelines for preprocessing, analyzing, and reporting practices. For the planning stage of fNIRS studies, similar guidance is desirable. Study preregistration helps researchers to transparently document study protocols before conducting the study, including materials, methods, and analyses, and thus, others to verify, understand, and reproduce a study. Preregistration can thus serve as a useful tool for transparent, careful, and comprehensive fNIRS study design.Aim: We aim to create a guide on the design and analysis steps involved in fNIRS studies and to provide a preregistration template specified for fNIRS studies.Approach: The presented preregistration guide has a strong focus on fNIRS specific requirements, and the associated template provides examples based on continuous-wave (CW) fNIRS studies conducted in humans. These can, however, be extended to other types of fNIRS studies.Results: On a step-by-step basis, we walk the fNIRS user through key methodological and analysis-related aspects central to a comprehensive fNIRS study design. These include items specific to the design of CW, task-based fNIRS studies, but also sections that are of general importance, including an in-depth elaboration on sample size planning.Conclusions: Our guide introduces these open science tools to the fNIRS community, providing researchers with an overview of key design aspects and specification recommendations for comprehensive study planning. As such it can be used as a template to preregister fNIRS studies or merely as a tool for transparent fNIRS study design.

show abstract

Künstliche Intelligenz und sichere Gesundheitsdatennutzung im Projekt KI-FDZ: Anonymisierung, Synthetisierung und sichere Verarbeitung für Real-World-Daten

Prasser,

Riedel,

Wolter

et al. 2024

Bundesgesundheitsbl

View full text Add to dashboard Cite

ZusammenfassungDie zunehmende Digitalisierung des Gesundheitswesens ist verbunden mit einem stetig wachsenden Datenvolumen, das durch Sekundärnutzung wertvolle Erkenntnisse über Diagnostik, Behandlungsprozesse und die Versorgungsqualität liefern kann. Das Forschungsdatenzentrum Gesundheit (FDZ) soll hierfür eine Infrastruktur bereitstellen. Dabei sind sowohl der Schutz der Privatsphäre der Patientinnen und Patienten als auch optimale Auswertungsmöglichkeiten von zentraler Bedeutung. Künstliche Intelligenz (KI) bietet hierfür ein doppeltes Potenzial. Zum einen ermöglichen Methoden des Machine Learning die Verarbeitung großer Datenmengen und die Analyse komplexer Zusammenhänge. Zum anderen können mithilfe von KI erzeugte synthetische – also künstliche – Daten die Privatsphäre schützen.In diesem Beitrag wird das Projekt KI-FDZ vorgestellt, welches innovative Technologien erforscht, die eine sichere Bereitstellung von Sekundärdaten für Forschungszwecke gewährleisten können. Es wird ein mehrschichtiger Ansatz untersucht, bei dem Maßnahmen auf Datenebene auf unterschiedliche Weise mit der Verarbeitung in sicheren Umgebungen kombiniert werden können. Dazu werden unter anderem Anonymisierungs- und Synthetisierungsmethoden anhand von 2 konkreten Anwendungsbeispielen evaluiert. Zudem wird untersucht, wie das Erstellen von Pipelines für maschinelles Lernen und die Ausführung von KI-Algorithmen in sicheren Umgebungen gestaltet werden können. Vorläufige Ergebnisse deuten darauf hin, dass mit diesem Ansatz ein hohes Maß an Schutz bei gleichzeitig hoher Datenvalidität erreicht werden kann. Der im Projekt untersuchte Ansatz kann ein wichtiger Baustein für die sichere Sekundärnutzung von Gesundheitsdaten sein.

show abstract

Open tools for quantitative anonymization of tabular phenotype data: literature review

Cited by 10 publications

References 49 publications

The Costs of Anonymization: Case Study Using Clinical Data

The Costs of Anonymization: Case Study Using Clinical Data

Using preregistration as a tool for transparent fNIRS study design

Künstliche Intelligenz und sichere Gesundheitsdatennutzung im Projekt KI-FDZ: Anonymisierung, Synthetisierung und sichere Verarbeitung für Real-World-Daten

Contact Info

Product

Resources

About