Secure Record Linkage of Large Health Data Sets: Evaluation of a Hybrid Cloud Model

Brown, Adrian; Randall, Sean

doi:10.2196/18920

Cited by 4 publications

(3 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As Soundex is vulnerable to errors that happen at the prefix of the encoded text, the proposed protocol deploys an optimization to the algorithm by encoding the reverse of the original text with the second phonetic algorithm. Brown et al [22] presented a new hybrid cloud model for PPRL. They used containers to distribute the record linkage workload across multiple nodes.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

An Effective Entity Resolution Approach for Big Data

El-Ghafar¹,

El-Bastawissy²,

Nasr³

et al. 2021

IJITEE

View full text Add to dashboard Cite

Entity Resolution (ER) is defined as the process 0f identifying records/ objects that correspond to real-world objects/ entities. To define a good ER approach, the schema of the data should be well-known. In addition, schema alignment of multiple datasets is not an easy task and may require either domain expert or ML algorithm to select which attributes to match. Schema agnostic meta-blocking tries to solve such a problem by considering each token as a blocking key regardless of the attributes it appears in. It may also be coupled with meta-blocking to reduce the number of false negatives. However, it requires the exact match of tokens which is very hard to occur in the actual datasets and it results in very low precision. To overcome such issues, we propose a novel and efficient ER approach for big data implemented in Apache Spark. The proposed approach is employed to avoid schema alignment as it treats the attributes as a bag of words and generates a set of n-grams which is transformed to vectors. The generated vectors are compared using a chosen similarity measure. The proposed approach is a generic one as it can accept all types of datasets. It consists of five consecutive sub-modules: 1) Dataset acquisition, 2) Dataset pre-processing, 3) Setting selection criteria, where all settings of the proposed approach are selected such as the used blocking key, the significant attributes, NLP techniques, ER threshold, and the used scenario of ER, 4) ER pipeline construction, and 5) Clustering where the similar records are grouped into the similar cluster. The ER pipeline could accept two types of attributes; the Weighted Attributes (WA) or the Compound Attributes (CA). In addition, it accepts all the settings selected in the fourth module. The pipeline consists of five phases. Phase 1) Generating the tokens composing the attributes. Phase 2) Generating n-grams of length n. Phase 3) Applying the hashing Text Frequency (TF) to convert each n-grams to a fixed-length feature vector. Phase 4) Applying Locality Sensitive Hashing (LSH), which maps similar input items to the same buckets with a higher probability than dissimilar input items. Phase 5) Classification of the objects to duplicates or not according to the calculated similarity between them. We introduced seven different scenarios as an input to the ER pipeline. To minimize the number of comparisons, we proposed the length filter which greatly contributes to improving the effectiveness of the proposed approach as it achieves the highest F-measure between the existing computational resources and scales well with the available working nodes. Three results have been revealed: 1) Using the CA in the different scenarios achieves better results than the single WA in terms of efficiency and effectiveness. 2) Scenario 3 and 4 Achieve the best performance time because using Soundex and Stemming contribute to reducing the performance time of the proposed approach. 3) Scenario 7 achieves the highest F-measure because by utilizing the length filter, we only compare records that are nearly within a pre-determined percentage of increase or decrease of string length. LSH is used to map the same inputs items to the buckets with a higher probability than dis-similar ones. It takes numHashTables as a parameter. Increasing the number of candidate pairs with the same numHashTables will reduce the accuracy of the model. Utilizing the length filter helps to minimize the number of candidates which in turn increases the accuracy of the approach.

show abstract

Section: Related Workmentioning

confidence: 99%

“…AtyImo is implemented over Apache Spark. No blocking or pruning techniques are implemented in [21], [22], [23] except for the last one as Different predicts have been analysed for blocking selection. Chen et al [24] examine the use of Spark-SQL for efficient parallel entity resolution.…”

Section: Related Workmentioning

confidence: 99%

An Effective Entity Resolution Approach for Big Data

El-Ghafar¹,

El-Bastawissy²,

Nasr³

et al. 2021

IJITEE

View full text Add to dashboard Cite

show abstract

“…When dealing with health care data, in particular, the lack of direct identifiers often means that a privacy-preserving record linkage (PPRL) is required to link the databases [12,13]; this method ensures that no personal data are revealed in the process of combining the datasets. Due to the potential errors and variation in indirect identifiers (e.g., a patient's name which could match as "Elizabeth", "Elisabeth", or "Liz"), probabilistic privacy-preserving linkages, often using Bloom filter encoding [14,15], have shown great success in health care datasets [13,[16][17][18][19]. Deterministic PPRLs, or combinations between probabilistic and deterministic algorithms, have also become more common and have had demonstrated success using healthcare data [20][21][22].…”

Section: Introductionmentioning

confidence: 99%

Validating a novel deterministic privacy-preserving record linkage between administrative & clinical data: applications in stroke research

Southwell

Bronskill

Gee

et al. 2022

IJPDS

View full text Add to dashboard Cite

IntroductionResearch data combined with administrative data provides a robust resource capable of answering unique research questions. However, in cases where personal health data are encrypted, due to ethics requirements or institutional restrictions, traditional methods of deterministic and probabilistic record linkages are not feasible. Instead, privacy-preserving record linkages must be used to protect patients' personal data during data linkage. ObjectivesTo determine the feasibility and validity of a deterministic privacy preserving data linkage protocol using homomorphically encrypted data. MethodsFeasibility was measured by the number of records that successfully matched via direct identifiers. Validity was measured by the number of records that matched with multiple indirect identifiers. The threshold for feasibility and validity were both set at 95%. The datasets shared a single, direct identifier (health card number) and multiple indirect identifiers (sex and date of birth). Direct identifiers were encrypted in both datasets and then transferred to a third-party server capable of linking the encrypted identifiers without decrypting individual records. Once linked, the study team used indirect identifiers to verify the accuracy of the linkage in the final dataset. ResultsWith a combination of manual and automated data transfer in a sample of 8,128 individuals, the privacy-preserving data linkage took 36 days to match to a population sample of over 3.2 million records. 99.9% of the records were successfully matched with direct identifiers, and 99.8% successfully matched with multiple indirect identifiers. We deemed the linkage both feasible and valid. ConclusionsAs combining administrative and research data becomes increasingly common, it is imperative to understand options for linking data when direct linkage is not feasible. The current linkage process ensured the privacy and security of patient data and improved data quality. While the initial implementations required significant computational and human resources, increased automation keeps the requirements within feasible bounds.

show abstract

A Review of Similarity Matching Over Encrypted Data

Shelake

Pare

2022

2022 5th International Conference on Advances in Science and Technology (ICAST)

View full text Add to dashboard Cite

Secure Record Linkage of Large Health Data Sets: Evaluation of a Hybrid Cloud Model

Cited by 4 publications

References 31 publications

An Effective Entity Resolution Approach for Big Data

An Effective Entity Resolution Approach for Big Data

Validating a novel deterministic privacy-preserving record linkage between administrative & clinical data: applications in stroke research

A Review of Similarity Matching Over Encrypted Data

Contact Info

Product

Resources

About