There is growing interest in a data integration approach to survey sampling, particularly where population registers are linked for sampling and subsequent analysis. The reason for doing this is simple: it is only by linking the same individuals in the different sources that it becomes possible to create a data set suitable for analysis. But data linkage is not error free. Many linkages are nondeterministic, based on how likely a linking decision corresponds to a correct match, that is, it brings together the same individual in all sources. High quality linking will ensure that the probability of this happening is high. Analysis of the linked data should take account of this additional source of error when this is not the case. This is especially true for secondary analysis carried out without access to the linking information, that is, the often confidential data that agencies use in their record matching. We describe an inferential framework that allows for linkage errors when sampling from linked registers. After first reviewing current research activity in this area, we focus on secondary analysis and linear regression modeling, including the important special case of estimation of subpopulation and small area means. In doing so we consider both robustness and efficiency of the resulting linked data inferences. This article is categorized under: Algorithms and Computational Methods > Maximum Likelihood Methods Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods
Statistical and Graphical Methods of Data Analysis > Multivariate Analysisefficiency, exchangeable linkage error, finite population inference, linked data, regression, robust estimation
| INTRODUCTIONData linkage is now an inextricable part of how data are obtained for analysis in modern science and public administration. The classical paradigm of first identifying a well-defined target population that can provide the data of interest and then measuring the values of the relevant variables for the individuals making up this population, or from a sample taken from it, is now often replaced by a data integration approach. This first links the records for the same individuals that are stored in the many population registers that are now available and then treats the resulting linked