Background Data anonymization and sharing have become popular topics for individuals, organizations, and countries worldwide. Open-access sharing of anonymized data containing sensitive information about individuals makes the most sense whenever the utility of the data can be preserved and the risk of disclosure can be kept below acceptable levels. In this case, researchers can use the data without access restrictions and limitations. Objective This study aimed to highlight the requirements and possible solutions for sharing health surveillance event history data. The challenges lie in the anonymization of multiple event dates and time-varying variables. Methods A sequential approach that adds noise to event dates is proposed. This approach maintains the event order and preserves the average time between events. In addition, a nosy neighbor distance-based matching approach to estimate the risk is proposed. Regarding the key variables that change over time, such as educational level or occupation, we make 2 proposals: one based on limiting the intermediate statuses of the individual and the other to achieve k-anonymity in subsets of the data. The proposed approaches were applied to the Karonga health and demographic surveillance system (HDSS) core residency data set, which contains longitudinal data from 1995 to the end of 2016 and includes 280,381 events with time-varying socioeconomic variables and demographic information. Results An anonymized version of the event history data, including longitudinal information on individuals over time, with high data utility, was created. Conclusions The proposed anonymization of event history data comprising static and time-varying variables applied to HDSS data led to acceptable disclosure risk, preserved utility, and being sharable as public use data. It was found that high utility was achieved, even with the highest level of noise added to the core event dates. The details are important to ensure consistency or credibility. Importantly, the sequential noise addition approach presented in this study does not only maintain the event order recorded in the original data but also maintains the time between events. We proposed an approach that preserves the data utility well but limits the number of response categories for the time-varying variables. Furthermore, using distance-based neighborhood matching, we simulated an attack under a nosy neighbor situation and by using a worst-case scenario where attackers have full information on the original data. We showed that the disclosure risk is very low, even when assuming that the attacker’s database and information are optimal. The HDSS and medical science research communities in low- and middle-income country settings will be the primary beneficiaries of the results and methods presented in this paper; however, the results will be useful for anyone working on anonymizing longitudinal event history data with time-varying variables for the purposes of sharing.
BACKGROUND Sharing and anonymising data have become hot topics for individuals, organisations, and countries around the world. Open-access sharing of anonymised data containing sensitive information about individuals makes the most sense whenever the utility of the data can be preserved and the risk of disclosure can be kept below acceptable levels. In this case, researchers can use the data without access restrictions and limitations. OBJECTIVE The goal of this paper is to highlight solutions and requirements for sharing longitudinal health and surveillance event history data in form of open-access data. The challenges lie in the anonymisation of multiple event dates and the time-varying variables. A sequential approach that adds noise to the event dates is proposed. This approach maintains the event order and preserves the average time between events. Additionally, a nosy neighbor distance-based matching approach to estimate the risk is proposed. Regarding dealing with the key variables that change over time such as educational level or occupation, we make two proposals, one based on limiting the intermediate status of a person (e.g. on education), and the other to achieve k-anonymity in subsets of the data. The proposed approaches were applied to the Karonga Health and Demographic Surveillance System (HDSS) core dataset, which contains longitudinal data from 1995 to the end of 2016 and includes 280,381 event records with time-varying, socio-economic variables and demographic information on individuals. The proposed anonymisation strategy lowers the risk of disclosure to acceptable levels thus allowing sharing of the data. METHODS statistical disclosure control, k-anonymity, adding noise, disclosure risk measurement, event history data anonymization, longitudinal data anonymization, data utility by visual comparisons. RESULTS Anonymized version of event history data including longitudinal information on individuals over time with high data utility. CONCLUSIONS The proposed anonymisation of study participants in event history data including static and time-varying status variables, specifically applied to longitudinal health and demographic surveillance system data, led to an anonymized data set with very low disclosure risk and high data utility ready to be shared to the public in form of an open-access data set. Different level of noise for event history dates were evaluated for disclosure risk and data utility. It turned out that high utility had been achieved even with the highest level of noise. Details matters to ensure consistency/credibility. Most important, the sequential noise approach presented in this paper maintains the event order. It has been shown that not even the event order is preserved but also the time between events is well maintained in comparison to the original data. We also proposed an anonymization strategy to handle the information of time-varying status of educational, occupational level of a person, year of death, year of birth, and number of events of a person. We proposed an approach that preserves the data utility well but limit the number of educational and occupational levels of a person. Using distance-based neighborhood matching we simulated an attack under a nosy neighbor situation and by using a worst-case scenario where attackers has full information on the original data. It could be shown that the disclosure risk is very low even by assuming that the attacker’s data base and information is optimal. The HDSS and medical science research communities in LMIC settings will be the primary beneficiaries of the results and methods presented in this science article, but the results will be useful for anyone working on anonymising longitudinal datasets possibly including also time-varying information and event history data for purposes of sharing. In other words, the proposed approaches can be applied to almost any event history data, and, additionally, to event history data including static and/or status variables that changes its entries in time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.