Abstract-To assess the quality of hospital care, national databases of standard medical procedures are common. A widely known example are national databases of births. If unique personal identification numbers are available (as in Scandinavian countries), the construction of such databases is trivial from a computational point of view. However, due to privacy legislation, such identifiers are not available in all countries. Given such constraints, the construction of a national perinatal database has to rely on other patient identifiers, such as names and dates of birth. These kind of identifiers are prone to errors. Furthermore, some jurisdictions require the encryption of personal identifiers. The resulting problem is therefore an example of Privacy Preserving Record Linkage (PPRL). This contribution describes the design considerations for a national perinatal database using data of about 600,000 births in about 1,000 hospitals. Based on simulations, recommendations for parameter settings of Bloom filter based PPRL are given for this real world application.
I. BACKGROUNDFor the medical assessment of German hospitals, a federal institution (GBA) 1 is obliged by law to link administrative records of more than 600,000 births yearly. The records are scattered across about 1,000 independent perinatal and neonatal units. The linked data is used for monitoring hospital performance and epidemiological analyses like spatial prevalence of very low birth weights. Due to privacy regulations, patient databases of hospitals are not linked by an electronic network. The hospitals use different electronic medical record systems, but have to use the same data exchange format. All details of the data exchange are part of a mandatory regulation. Because the German health insurance system has no common unique personal identifier number, other patient identifiers have to be used. Since the current regulations do not allow names in any form, encrypted or not, current linkage is based on different combinations of health insurance numbers, birth weight and hospital identifiers. Given the described constraints, only about 80% of the records can be linked [1].From a statistical point of view, non-linked records might cause a missing data problem [2]. If the fact, that a true link is missed, depends on variables of interest, this is referred to as differential linkage error [3], [4]. This might result in biased estimates of causal effects and population parameters [5], [6]. Concerning our field of application, evidence of bias caused by differences between linked and non-linked maternal data sets has been published [7], [8]. The easiest way to 1 For details, see www.english.g-ba.de.reduce differential linkage bias is improving the linkage rate. Therefore, using additional identifiers has been proposed to the regulatory authority [9]. As previous research has shown [10] [30]. In this paper, we will describe the design of a national perinatal database using Bloom filters for PPRL. 2 We are not aware of any published attack against encryp...