Data linkage refers to the process of identifying and linking records that refer to the same entity across multiple heterogeneous data sources. This method has been widely utilized across scientific domains, including public health where records from clinical, administrative, and other surveillance databases are aggregated and used for research, decision making, and assessment of public policies. When a common set of unique identifiers does not exist across sources, probabilistic linkage approaches are used to link records using a combination of attributes. These methods require a careful choice of comparison attributes as well as similarity metrics and cutoff values to decide if a given pair of records matches or not and for assessing the accuracy of the results. In large, complex datasets, linking and assessing accuracy can be challenging due to the volume and complexity of the data, the absence of a gold standard, and the challenges associated with manually reviewing a very large number of record matches. In this paper, we present AtyImo, a hybrid probabilistic linkage tool optimized for high accuracy and scalability in massive data sets. We describe the implementation details around anonymization, blocking, deterministic and probabilistic linkage, and accuracy assessment. We present results from linking a large population-based cohort of 114 million individuals in Brazil to public health and administrative databases for research. In controlled and real scenarios, we observed high accuracy of results: 93%-97% true matches. In terms of scalability, we present AtyImo's ability to link the entire cohort in less than nine days using Spark and scaling up to 20 million records in less than 12s over heterogeneous (CPU+GPU) architectures.
Background Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. Methods We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. Results Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. Conclusion CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures.
Background Brazil has made great progress in reducing child mortality over the past decades, and a parcel of this achievement has been credited to the Bolsa Família program (BFP). We examined the association between being a BFP beneficiary and child mortality (1–4 years of age), also examining how this association differs by maternal race/skin color, gestational age at birth (term versus preterm), municipality income level, and index of quality of BFP management. Methods and findings This is a cross-sectional analysis nested within the 100 Million Brazilian Cohort, a population-based cohort primarily built from Brazil’s Unified Registry for Social Programs (Cadastro Único). We analyzed data from 6,309,366 children under 5 years of age whose families enrolled between 2006 and 2015. Through deterministic linkage with the BFP payroll datasets, and similarity linkage with the Brazilian Mortality Information System, 4,858,253 children were identified as beneficiaries (77%) and 1,451,113 (23%) were not. Our analysis consisted of a combination of kernel matching and weighted logistic regressions. After kernel matching, 5,308,989 (84.1%) children were included in the final weighted logistic analysis, with 4,107,920 (77.4%) of those being beneficiaries and 1,201,069 (22.6%) not, with a total of 14,897 linked deaths. Overall, BFP participation was associated with a reduction in child mortality (weighted odds ratio [OR] = 0.83; 95% CI: 0.79 to 0.88; p < 0.001). This association was stronger for preterm children (weighted OR = 0.78; 95% CI: 0.68 to 0.90; p < 0.001), children of Black mothers (weighted OR = 0.74; 95% CI: 0.57 to 0.97; p < 0.001), children living in municipalities in the lowest income quintile (first quintile of municipal income: weighted OR = 0.72; 95% CI: 0.62 to 0.82; p < 0.001), and municipalities with better index of BFP management (5th quintile of the Decentralized Management Index: weighted OR = 0.76; 95% CI: 0.66 to 0.88; p < 0.001). The main limitation of our methodology is that our propensity score approach does not account for possible unmeasured confounders. Furthermore, sensitivity analysis showed that loss of nameless death records before linkage may have resulted in overestimation of the associations between BFP participation and mortality, with loss of statistical significance in municipalities with greater losses of data and change in the direction of the association in municipalities with no losses. Conclusions In this study, we observed a significant association between BFP participation and child mortality in children aged 1–4 years and found that this association was stronger for children living in municipalities in the lowest quintile of wealth, in municipalities with better index of program management, and also in preterm children and children of Black mothers. These findings reinforce the evidence that programs like BFP, already proven effective in poverty reduction, have a great potential to improve child health and survival. Subgroup analysis revealed heterogeneous results, useful for policy improvement and better targeting of BFP.
Background: Research using linked routine population-based data collected for non-research purposes has increased in recent years because they are a rich and detailed source of information. The objective of this study is to present an approach to prepare and link data from administrative sources in a middle-income country, to estimate its quality and to identify potential sources of bias by comparing linked and non-linked individuals. Methods: We linked two administrative datasets with data covering the period 2001 to 2015, using maternal attributes (name, age, date of birth, and municipally of residence) from Brazil: live birth information system and the 100 Million Brazilian Cohort (created using administrative records from over 114 million individuals whose families applied for social assistance via the Unified Register for Social Programmes) implementing an in house developed linkage tool CIDACS-RL. We then estimated the proportion of highly probably link and examined the characteristics of missed-matches to identify any potential source of bias. Results: A total of 27,699,891 live births were submited to linkage with maternal information recorded in the baseline of the 100 Million Brazilian Cohort dataset of those, 16,447,414 (59.4%) children were found registered in the 100 Million Brazilian Cohort dataset. The proportion of highly probably link ranged from 39.3% in 2001 to 82.1% in 2014. A substantial improvement in the linkage after the introduction of maternal date of birth attribute, in 2011, was observed. Our analyses indicated a slightly higher proportion of missing data among missed matches and a higher proportion of people living in an urban area and self-declared as Caucasian among linked pairs when compared with non-linked sets. Discussion: We demonstrated that CIDACS-RL is capable of performing high quality linkage even with a limited number of common attributes, using indexation as a blocking strategy in larg e routine databases from a middleincome country. However, residual records occurred more among people under worse living conditions. The results presented in this study reinforce the need of evaluating linkage quality and when necessary to take linkage error into account for the analyses of any generated dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.