Population Data BC (PopData) was established as a multi-university data and education resource to support training and education, data linkage and access to individual level, de-identified data for research in a wide variety of areas including human and community development and well-being. A combination of deterministic and probabilistic linkage is conducted based on the quality and availability of identifiers for data linkage. PopData utilizes a harmonized data request and approval process for data stewards and researchers to increase efficiency and ease of access to linked data. Researchers access linked data through a secure research environment (SRE) that is equipped with a wide variety of tools for analysis. The SRE also allows ongoing management and control of data. PopData continues to expand its data holdings and to evolve its services as well as governance and data access process.
ObjectivesCurrently, a probabilistic linkage is performed by our organization with final linkage classification established by users with expert knowledge applying rules referencing weight and comparison outcome sets. The particular classification results in a perceived comprehensive linkage. Acknowledged weaknesses are variation in expert knowledge and its application. Also, consideration of expert rules is often time-consuming.We piloted a new approach, involving a file level summary of "positive predictive value" for weights and outcome sets. We contrast the new approach with the previous one and identify strengths and weaknesses. ApproachWe resolve linkages using two different approaches, the existing method that expert users apply rules to weight and comparison outcomes, and a recent one using positive predictive values (PPV) that reduce resolution subjectivity.The new method produces summary true positive, false positive and positive predictive values for each weight and outcome set within a file of candidate pairs above the cutoff. The topweighted pair increments the true positive (TP) for the weight and outcome set on the record. All other candidate pairs increment false positive counts (FP). At the end of the file, PPV is calculated for each weight and outcome as TP/(TP+FP). Additionally, a .9 PPV weight threshold is established from the summary excluding weights with less than N occurrences. The approach presumes successful link Accepted links include top-ranked records whose weight is greater than the .9 PPV threshold and top candidates with an outcome summary PPV >= .9, excluding outcome sets with less than N occurrences. ResultsIn the pilot, the new method produced linkage results near par with the previously employed methods. Importantly, they establish accepted links by a consistent methodology, allowing for increased standardization. The method eases identification of a threshold weight and referencing summary comparison outcome PPVs identifies additional confident links in larger populationbased linkages which a single weight threshold may exclude. Components require minimal tuning to data characteristics, error tolerance, and result expectations. The new method is unable to identify links established in the legacy method by additional processing or manual review.
IntroductionPrivacy-preserving Record Linkage (PPRL) is a record linkage technique that can increase the security of personal information. PPRL uses techniques of either hashing identifiers (where exact matches are required) or Blooming identifiers (where partial matches are of interest before they are provided for linkage. Objectives and ApproachWe use LinXmart software to evaluate performance of PPRL linkage compared to linkage using clear text identifiers. The test linkage dataset is one that is routinely linked (N=2,672,257) at our linkage centre. The population spine (N=8,440,442) includes a record for every person who has resided in British Columbia, Canada over the past 30 years. Weights were determined using LinXmart’s implementation of the Expectation Maximization (EM) algorithm. For both linkages, accepted links were the highest-weighted candidate link with a weight above the threshold suggested by EM estimation. We compare linkage rates and quality and differences in weight and threshold estimations between clear-text and PPRL linkages ResultsClear-text and PPRL methods resulted in 97% and 90% linkage rates, respectively. Approximately 67% of records in the linked datasets contained a nominally unique ID. Records with a unique ID linked at higher rates (>99% for both clear-text and PPRL) while the linkage rate for records missing the ID differed substantially (92% /70% for clear-text/PPRL). Comparing PPRL linkage to the clear-text linkage, we obtain F-measures of 0.99 and 0.80 for records with and without the unique ID, respectively. Conclusion / ImplicationsLinkage performance may be attributable to differences in comparison operators between the two methods. Bloomed fields compared with Dice coefficient allow for partial matching but may not be as sensitive as clear-text string comparisons. Numerical comparisons in PPRL are exact matches while clear-text comparisons allow for more sophisticated matching. Further refinements in PPRL are being explored to improve these results.
IntroductionLigo is an open source application that provides a framework for managing and executing administrative data linking projects. Ligo provides an easy-to-use web interface that lets analysts select among data linking methods including deterministic, probabilistic and machine learning approaches and use these in a documented, repeatable, tested, step-by-step process. Objectives and ApproachThe linking application has two primary functions: identifying common entities in datasets [de-duplication] and identifying common entities between datasets [linking]. The application is being built from the ground up in a partnership between the Province of British Columbia’s Data Innovation (DI) Program and Population Data BC, and with input from data scientists. The simple web interface allows analysts to streamline the processing of multiple datasets in a straight-forward and reproducible manner. ResultsBuilt in Python and implemented as a desktop-capable and cloud-deployable containerized application, Ligo includes many of the latest data-linking comparison algorithms with a plugin architecture that supports the simple addition of new formulae. Currently, deterministic approaches to linking have been implemented and probabilistic methods are in alpha testing. A fully functional alpha, including deterministic and probabilistic methods is expected to be ready in September, with a machine learning extension expected soon after. Conclusion/ImplicationsLigo has been designed with enterprise users in mind. The application is intended to make the processes of data de-duplication and linking simple, fast and reproducible. By making the application open source, we encourage feedback and collaboration from across the population research and data science community.
ObjectivesPopulation Data BC (PopData) is an agency in British Columbia, Canada, that routinely performs linkages of various administrative and researcher-collected data to a population spine. We developed a linkage report template in order to increase transparency of linkage process and outcome for end users and data providers. ApproachPopData performs probabilistic and deterministic data linkage using an in-house software. A literature review identified existing guidelines and examples of linkage reporting. A survey collected input from a wide range of end users about their interest in receiving linkage reports and specific information that is important to their work. A draft template was developed by PopData’s linkage experts and data scientists which then was reviewed by PopData staff and external partners. Privacy requirements, mode of delivery, readability to the intended audience and operational feasibility were carefully considered. ResultsThe resulting template built on our existing internal linkage summaries. The report follows a framework suggested in the literature with three key components: 1) information on the data source and linkage fields, 2) data pre-processing and linkage methodology, and 3) linkage results, presented in tables and figures, including overall linkage rates, detail on matched fields, and the distribution of linkage weights of linked and unliked pairs. In addition, an appendix describes the linkage methods and population spine in detail, and supplementary notes will comment on unique issues related to the data, when those are applicable. Educational materials to aid understanding of linkage methodologies and reporting are also under development. ConclusionLinked data are increasingly used in research, making it important to provide information on linkage process and performance to the research community. Rigorous and standardized linkage reports produced by data centres can facilitate evaluation of the impact of linkage performance on research findings and enable transparent reporting in peer-reviewed research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.