Record linkage is the process of identifying records that refer to the same entities from different data sources. While most research efforts are concerned with linking individual records, new approaches have recently been proposed to link groups of records across databases. Group record linkage aims to determine if two groups of records in two databases refer to the same entity or not. One application where group record linkage is of high importance is the linking of census data that contain household information across time. In this paper we propose a novel method to group record linkage based on multiple instance learning. Our method treats group links as bags and individual record links as instances. We extend multiple instance learning from bag to instance classification to reconstruct bags from candidate instances. The classified bag and instance samples lead to a significant reduction in multiple group links, thereby improving the overall quality of linked data. We evaluate our method with both synthetic data and real historical census data.
Abstract. Linking historical census data across time is a challenging task due to various reasons, including data quality, limited individual information, and changes to households over time. Although most census data linking methods link records that correspond to individual household members, recent advances show that linking households as a whole provide more accurate results and less multiple household links. In this paper, we introduce a graph-based method to link households, which takes the structural relationship between household members into consideration. Based on individual record linking results, our method builds a graph for each household, so that the matches are determined by both attribute-level and record-relationship similarity. Our experimental results on both synthetic and real historical census data have validated the effectiveness of this method. The proposed method achieves an Fmeasure of 0.937 on data extracted from real UK census datasets, outperforming all alternative methods being compared.
Linking historical census data is an important task for the study of the social, economic, and demographic aspects of families and society in the past. Although various (semi-) automatic linking methods have been proposed, stateof-the-art methods have only been targeted at linking records that correspond to individuals. In this paper, we introduce an automatic method aimed at linking both individuals and households across several historical census datasets. The proposed method contains several steps, including data quality analysis and enhancement, household identity detection, as well as individual and household record linking. We have applied this method to a set of six census datasets collected from the district of Rawtenstall in North-East Lancashire in the United Kingdom between 1851 and 1901. Experimental results show that the proposed method can greatly reduce the ambiguity arising from the individual record linkage, and facilitate the accurate matching of households across several decades.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.