Guidance

Efficient procedures for linking datasets for the purpose of fitting statistical models

Updated 16 July 2021

Harvey Goldstein

University College London Institute of Child Health and University of Bristol

Contact: Katie Harron

1. Introduction

Traditional approaches to data linkage concentrate on producing a single ‘linked’ file based upon the matching of individual identifiers such as name and date of birth. In real datasets, and particularly administrative datasets, such identifiers may contain errors due to miscoding or mistyping. This leads to some records matching only partially, and for these, procedures have been devised for deciding whether the pattern of identifier agreement justifies declaring that a particular pair of records are matched, or whether some records should remain unmatched. The most straightforward of these procedures employ deterministic or rule-based linkage, where records are classified as being a link if a specified set of identifiers agree. Others use match weights, based on the accuracy and discriminatory power of each identifier, to classify records (known as probabilistic linkage). The procedure and classifier used directly influences which records are retained in the final linked dataset: if a different linkage strategy were used, a different version of the linked file would likely be produced.

In a simple case we will have two files, which we term a primary and a secondary file, where the resulting linked file will essentially be an enhanced primary file where each record is associated with a record from the secondary file, and where some records remain unmatched. It is then up to the data analyst to decide how to deal with the unmatched cases that have incomplete records, for example by ignoring them or possibly using some imputation procedure for dealing with the missing information. In either case, it is clear that unless certain assumptions are made, any resulting analyses may be subject to bias. Thus, where incomplete records are ignored, unless the ‘failure to match completely’ is a random process, the complete records will constitute a biased sample. Likewise, imputation also assumes that the analyst is able to adjust for any factors associated with non-random matching. In addition, if a threshold for accepting a partial match is used, in some cases a record that is not the correct match will be accepted as if it were, and a record that is the right match not accepted. Such false matches can also lead to bias. See Moore et al. (2014), Doidge and Harron (2019), and Harron et al. (2020) for further discussion.

In this chapter we shall review the linkage process and its justification, and outline a new computationally efficient approach for deriving match weights that can be used to classify record pairs. We then consider the problem of uncertainties associated with ‘less than perfect’ matching as a ‘missing data’ problem rather than a data linkage one, and describe how a full probabilistic approach to linkage allows these uncertainties to inform the statistical analyses of linked data and hence reduce biases that arise with more traditional approaches.

2. Algorithms for linkage

We consider the case of linking two files. Approaches to linkage, based on a common set of identifiers, can be classified as those for which ‘training data’ are available and those where such data are unavailable. If a large enough subset of records is available where the linkage status is known (in other words, for a majority of the agreement patterns defined by whether the set of identifiers agree or do not agree), then this can be used to assign the remaining pairs of records. We do not consider this case further since for many datasets such training data are unavailable or unreliable. In the alternative ‘unsupervised’ situation, we shall describe different approaches in the simple case where we have p identifiers where for any given pair of records, one from the primary file and one from the linking file, and for each identifier, there is agreement or not. This thus gives rise to 2p possible patterns. Where all identifiers agree the record pair is typically regarded as constituting a correct match, and where all disagree regarded as a definite non-match. The aim of unsupervised linkage is to assign, for each identifier, a weight for disagreement and a weight for agreement. Once this has been done, the match weight for every possible pair of records is computed by summing the constituent weights for the particular pattern observed. Of course, in practice we may find that a record in the primary file matches perfectly with more than one record in the linking file, or that a primary file record matches two or more records with approximately the same constituent weights. Rules can be devised to deal with such cases and we can ignore their existence for present purposes.

Traditional procedures for linkage have been divided into ‘deterministic’ and ‘probabilistic’ ones. In fact, this is rather misleading since all of them essentially make a ‘deterministic’ assumption (Doidge and Harron, 2018). A basic deterministic procedure comprises a set of rules for accepting a pair as a true match, in a decreasing sequence that is intended to rank them in terms of a likelihood of a match on the basis of the observed agreement/disagreement pattern. Thus, for example, where every identifier agrees this would normally be regarded as a definite match. Next might be a rule where every identifier except a very unreliable one agrees, and so on. Then an upper cut off is determined so that every record with a pattern above the cut-off point is accepted, and a lower cut-off is determined so that every record with a pattern below the cut-off is rejected. In between the two thresholds, if these are not the same, potential matches may be retained for further review. A more sophisticated version is where it is assumed that for each possible matching pattern there is an underlying probability of being a match or non-match. A so-called latent class model is then fitted to the data resulting in a set of weights as described above, and thresholds chosen. In all of these cases, traditionally, each possible pair of records is designated as a match or non-match.

In the next section we consider the linking algorithm itself in more detail.

3. Efficient procedures for linking two data files

Harron et al. (2015) provide a review and exposition of methods for data linkage, typically based upon a model introduced by Fellegi and Sunter (F-S) (1969), which we now describe.

The aim of traditional probabilistic record linkage methods is to determine a set of weights, or scores, for the set C of all possible pairs of records taken from the primary file (A) and the linking file (B). It allows a classification of the elements of C into ‘matches’, ‘non-matches’, or undecided matches, ranked according to estimated weights. A cut-off or threshold weight is chosen to classify the record pairs, which thus determines false-positive and false-negative error rates, in other words those record pairs falsely matched or falsely non-matched.

For each identifier or ‘field’, two conditional probabilities are defined for each record pair iA, iB of C. For simplicity of exposition assume that A and B contain the same set of individuals, that there are p identifiers, indexed by j, and that we measure either agreement or disagreement rather than degrees of agreement. This will suffice to motivate our comments on the algorithm, although these can be readily extended to other linkage scenarios. We have the probability of observing agreement given that it is the same individual (1). Also, the probability of observing agreement given that it is not the same individual (2):

: m subscript j equals p open parenthesis agreement on identifier j such that i subscript a equals i subscript b close parenthesis

: u subscript j equals p open parenthesis agreement on identifier j such that i subscript a does not equal i subscript b close parenthesis

The first default assumption is now made that for any record pair, these probabilities, for each identifier, are independent so that we can write the joint probability for the observed agreement/non-agreement values (y) of a pair, indexed by l as:

: p open parenthesis y subscript i 1 up to value y subscript i p close parenthesis equals open parenthesis capital pi subscript j m subscript j close parenthesis pi subscript i a does not equal i subscript b

where πiA = iB and πiAiB are respectively the probabilities that iA = iB (a match) and iA ≠ iB (a non-match). The standard analysis proceeds by writing down the ‘likelihood’ for the data as:

: capital pi subscript i p open parenthesis y subscript i 1 up to value y subcript i p close parenthesis

and maximising it for the mj and uj, typically using an Expectation-Maximisation (EM) algorithm (‘unsupervised’ linkage). Alternatively, where ‘training’ data are available, parameter estimates can be derived from these using the known true match status (‘supervised’ linkage).

: r equals open parenthesis capital pi subscript j m subcript j close parenthesis divided by open parenthesis capital pi subcript j u subscript j close parenthesis

Equations (3) and (4) define a latent class model where the final classifier or ‘weight’ is a function of equation (5), usually log2R. These weights are then summed over the identifiers to give an overall weight to each pair, which is used to classify a pair as a ‘match’ or ‘non-match’ according to whether a suitably chosen threshold value is exceeded.

In fact, (2) does not represent a true likelihood since the elements are not strictly independent; the observed identifier patterns for a set of record pairs associated with any given individual are related as follows. Given the observed pattern for just one single true match for a given record in the primary file, the probabilities associated with all other possible records from the linking file must be zero. If, for a given primary file record there are in fact two or more ‘true’ matches, that is, indistinguishable in terms of the data, then again, the remaining linking file record probabilities must be zero. This dependency violates the assumption of independence in the latent class model specified above. This is the second vital assumption in the method and it implies that the usual ‘optimum’ properties associated with maximum likelihood methods of estimation do not hold and this, and similar, procedures are best described as alternative procedures to the rule pattern methods of assigning a ‘plausible’ set of weights according to the ‘loss function’ that is implied by implementing (1) - (2). Like the former, the method implies that a certain proportion of records will be misclassified, and this will subsequently lead to estimation errors. It is this issue that Goldstein et al. (2012) deal with, which we shall return to below.

4. A new procedure for estimating linkage weights

Goldstein et al. (2017) set out an alternative procedure for estimating match weights, which does not rely upon any statistical modelling assumptions and has intuitive appeal. The motivation for this approach is that match weights effectively represent a scale, ranging from disagreement on all identifiers (low confidence that records are a match) to agreement on all identifiers (high confidence that records are a match). Records that agree on some identifiers and disagree on others have match weights that fall somewhere on the scale between these two limits. Viewed in this way, the problem could be approached using methods derived from the field of correspondence analysis.

The scaling algorithm described here is based on correspondence analysis and is related to principal component analysis but for categorical data. It makes use of methods previously described by Healy & Goldstein (1976) in the context of scales for developmental maturity based on categorical attributes. Like the traditional F-S approach, this approach aims to derive a weight for each identifier; the combination of weights across multiple identifiers can be used to classify record pairs as links or not. The procedure can be summarised as follows and is described in detail in Goldstein et al. (2017).

For each of p identifiers, j, we assume that we have several (ordered) states denoted by K = 1, …kj where 1 is the least agreement and kj is the greatest level of agreement between the file of interest and the linking data file. In the simple binary case that we have previously considered, there are two states (agree/not agree) for each identifier j. This gives a total of K = ∑jkj categories over all identifiers. For example, binary agreement/disagreement on each of four identifiers would give a total of K = 8 categories.

We seek to estimate a score Xjk for state k and identifier j, where the classifier for pair i now takes the form Zi = ∑jZij, and Zij = Xjk if pair i has state k for identifier j. We fix the average classifier values for a definite match as ∑jXjkj = 1 and for a definite non-match as ∑jXj1 = 0. That is, the sum of the scores for each greatest level of agreement is 1 and the sum of the scores for each lowest level of agreement is 0. We seek to minimise a ‘loss function’ based upon minimising the total within-pair discrepancy D between scores, where:

D = ∑idi, and the average within-pair squared discrepancy between weights (z):

: d subcript i equals open parenthesis 1 over p close parenthesis d equals sum subscript j open parenthesis z subscript i j minus z subscript i close parenthesis squared

The scores can be derived using a straightforward and computationally efficient algorithm based on completing a matrix of counts of pairs with observed agreement patterns and solving a set of non-homogenous linear equations. The resulting scores are analogous to the ‘weights’ defined in traditional methods as we have described above.

Where an identifier value is missing, if we assume that missing data occur completely at random, or at random conditionally on values of other identifiers or other record values (Missing At Random (MAR)), then we can draw a value at random from the appropriate posterior distribution, estimated separately for each file. This forms a part of the linkage algorithm where the distribution of the values of the missing variable is estimated by conditioning on the remaining identifier variables, or even possibly on the substantive ‘payload’ variables (any auxiliary variables that predict missingness). The uncertainty introduced, at least in theory, can be incorporated by carrying out the estimation several times (for instance, 20) and allowing uncertainties to be attached to the weights and carried through to the analysis. This will be time consuming and needs to be the subject of further research. In practice, so long as the number of missing identifier values is not too large, the effect on the weights may be small and ignorable.

The result of the scaling algorithm is a score assigned to every possible record pair. Preliminary analysis (Goldstein et al., 2017) shows that this procedure provides weights that result in a similar ranking of identifiers as traditional methods (meaning agreement on date of birth would be given a higher value than agreement on sex) and linkage results that are at least as good, although a full-scale validation has not yet been done. Furthermore, the process is computationally more efficient than traditional methods. The computational advantage lies in the fact that, apart from the final matrix manipulations, the method uses a simple counting algorithm rather than an iterative procedure, and is economical with storage. This is important for very large administrative datasets. Another benefit is that this method does not rely on training data or the EM algorithm for estimating relevant parameters.

5. Linkage as a missing data problem

Depending on the approach used to classify record pairs (such as deterministic, F-S, or the scaling method described above), different versions of a linked dataset might be produced, reflecting the uncertainty associated with imperfect identifiers. In the following section we address this as a missing data problem and outline an approach for handling this uncertainty.

By missing data, we refer to the following situation that occurs during linkage. We have a ‘primary’ file where certain variables are not measured, but are present in a secondary ‘linking’ file. The linking file typically contains a superset of the individuals in the primary file, with a common set of identifiers that are used in the matching process. For those records that are fully matched, we can transfer the required variables from the linking file to fill in the values in the primary file, for those variables only present in the linking file. Often this will result in 80% or more complete primary file records, and if we are prepared to assume that the remaining records effectively have their data values ‘MAR’, then we have a standard ‘missing data’ scenario where we can ‘impute’ the remaining 20% of missing values in these records. This then allows us to fit our statistical models and obtain valid results, using standard techniques (Carpenter and Kenwood 2013). There may also be missing values within the complete (linked) records, which can be treated similarly. In practice, the assumption of MAR may not be realistic, so that an attempt to bring across as much information as possible from the linking file will generally be helpful. The point, however, is that the linking process is seen as a secondary means to the end of obtaining valid inferences from the final set of fitted models. It is this that motivates the approach of Goldstein et al. (2012) who adopt a Bayesian approach in terms of matching-related linkage probabilities, and in this sense is a true probabilistic linkage method as opposed to earlier approaches, for example based upon the F-S algorithm, which are best viewed as sophisticated deterministic approaches.

In the typical case, information passed to the data analyst does not contain detailed information about, for example, match weights, simply whether, for each primary file record, there is a matched linking file record or not. Goldstein et al. (2012) and various others point out that the failure to represent statistical uncertainty associated with forcing each pair to be either a match or non-match in general leads to biased analyses. Thus, an improvement would be to supply the analyst with more information about records not meeting cut-off thresholds, allowing them to explore the effect of changing these thresholds. This is certainly useful, but does not resolve the problem fully, in the sense that when results are interpreted there remains uncertainty associated with any dependence of results on choice of cut-off thresholds. As a general solution Goldstein et al. (2012) proposed a true probabilistic procedure that carries over such statistical uncertainty within a formal, Bayesian framework, and that generally leads to unbiased inferences. The methodology is based around models for imputing missing data values.

The procedure described above (where multiple imputation is used to produce values for unmatched records) is based on the concept that we are more interested in carrying over the correct values for analysis, than producing one version of a linked dataset. However, the standard multiple imputation procedure does not make use of the information we have about candidate records for which there is some uncertainty. If, however, match weights or scores for each record pair are supplied to the analyst, then the values for the variables in the unmatched records become ‘partially observed’ rather than missing (Goldstein et al., 2009). That is, the potential values that these variables could take is informed by the values in the candidate records and their associated match weights.

The Bayesian framework for carrying over statistical uncertainty to analysis proceeds as follows and is described in detail in Goldstein et al. (2012). For every primary file record, we can retain all candidate linking file records (typically just two or three records) where the assigned weight is non-negligible, that is to say it falls below a small threshold value, and carry these forward into the analysis stage. The set of weights carried over are rescaled to add to 1.0 and treated as ’prior’ probabilities. Where no records exceed the threshold, a set of missing values will be carried over.

We can then proceed by forming a probability distribution that is a direct function of the set of match weights for each unmatched record. This prior distribution is then combined with the (conditional) likelihood for the variables of interest, derived from the observed relationships between variables in the matched records. This combination of prior and likelihood forms an updated posterior distribution. Values can then be sampled from the posterior distribution as in the standard multiple imputation framework. This process is repeated a number of times to produce multiple imputed datasets. Estimates and standard errors are averaged over the imputed data (Rubin 1987). This method is known as ‘prior-informed imputation’, since the information in candidate linking records forms a prior that is used to inform the imputation of missing values. It has been shown to provide standard errors that properly account for the uncertainty arising from linkage error (Harron et al. 2014). A further benefit is that it can avoid bias due to linkage success being related to particular characteristics of a record (such as age or socioeconomic status), since these characteristics can be included within the imputation model.

In order to maximise the usefulness of the methods described above, the procedures need to operate on a set of known identifier values. This allows, for example, measures of similarity to be incorporated into linkage algorithms (rather than agreement patterns based on agreement/disagreement of identifiers only) and thus into analysis. This means that irreversible encryption methods attempting to ‘hash’ identifier values are at best approximations that are not immune to biases. Thus, so-called ‘pseudonymisation at source’ procedures should be avoided, and instead replaced by secure, reversible encryption methods that rely upon a ‘trusted third party’ to see the true identifier values, but nobody else, including any of the data providers (except their own data). See Goldstein and Harron (2018) for an extended discussion.

The implementation of these methods in open source software is the focus of ongoing research. A related topic is the distribution of large record-linked datasets to researchers for analysis, and the possibilities of producing ‘synthetic’ data that are sufficiently faithful to the original data structure to be useful in exploratory analyses. Recent work by Goldstein and Shlomo (2020) provides a general method for anonymisation that both protects privacy and at the same time allows data analysts to extract consistent model estimates.

In conclusion, the recent work described here promises to provide efficient, practical methods for future data linkage that avoid or minimise bias in existing implementations. Work still needs to be done to study problems that may occur in practice with very large and complex datasets. Development of software that is easy to use is needed and attention provided to the training of data analysts and to ways in which data providers and data analysts can cooperate fruitfully (Gilbert et al. 2017).

7. References

Carpenter and Kenward (2013) Multiple imputation and its application. Wiley, Chichester.

Doidge, J. and K. Harron (2018). “Demystifying probabilistic linkage.” Int J Popul Data Sci 3(1).

Doidge, J. and K. Harron (2019). Linkage error bias. Int J Epidemiol. dyz203.

Fellegi, I and Sunter, A (1969). A Theory for Record Linkage. Journal of the American Statistical Association 64, 1183–1210. DOI:10.2307/2286061

Harron, K., et al. (2014). “Evaluating bias due to data linkage error in electronic healthcare records.” BMC Med Res Methodol 14(1): 36.

Harron, K., Goldstein, H. and Dibben, C. (Eds.) (2015). Methodological Developments in data Linkage. Chichester, Wiley

Harron, K., et al. (2020). “Assessing data linkage quality in cohort studies.” Annals of Human Biology 47(2): 218-226.

Healy, M. and H. Goldstein (1976), An approach to the scaling of categorized attributes. Biometrika,. 63(2): p. 219-229.

Gilbert, R., et al. (2017). “GUILD: Guidance for Information about Linking Datasets.” J Public Health 1-8.

Goldstein, H., et al. (2009). “Multilevel models with multivariate mixed response types.” Stat Model 9(3): 173-197.

Goldstein, H., Harron, K., and Wade, A. (2012). The analysis of record linked data using multiple imputation with data value priors. Statistics in medicine, 31, DOI:10.1002/sim.5508

Goldstein H and Harron K (2018); ‘Pseudonymisation at source’ undermines accuracy of record linkage, Journal of Public Health, fdy083

Goldstein, H. and Shlomo, N. (2020). A Probabilistic Procedure for Anonymisation, for Assessing the Risk of Re-identification and for the Analysis of Perturbed Datasets. J. Official Statistics 36(1): 89.

Moore CL, Amin J, Gidding HF, Law MG (2014) A New Method for Assessing How Sensitivity and Specificity of Linkage Studies Affects Estimation. PLoS ONE 9(7): e103690. DOI:10.1371/journal.pone.0103690

Rubin, D. (1987) Multiple imputation for nonresponse in surveys. New York: Wiley.