Guidance

Linking with anonymised data – how not to make a hash of it

Updated 16 July 2021

Rachel Shipsey and Josie Plachta, Office for National Statistics

1. Abstract

The need to share and link data between government departments is increasing as we attempt to make full use of administrative data. The motivation for this is a reduction in respondent burden and cost as the number of surveys are minimised. Data sharing is a sensitive issue and data custodians, who are personally responsible for the data and its security, may not be willing to make identifiable data available to a second party or a trusted third party. However, in some cases, the data can be shared if it is first anonymised. Working with anonymised data is a challenge from a linkage perspective, and whilst we recommend that data linkage should be carried out using variables in-the-clear wherever possible, we accept that in some instances this is not possible. Here we describe some of the challenges and limitations of linking with anonymised data and outline a method for linking hashed data that has been developed at the Office for National Statistics.

2. Introduction

Data linkage (Harron et al, 2016) is the process of deciding whether two records, in the same or different data sets, relate to the same entity. In this chapter an entity is a person. Variables in each record, such as name, date of birth and address, can be compared to determine the level of agreement between the two records and hence whether the records are a match. Such variables contain Personal Identifying Information (PII) and in some cases it is decided by the data owner, for reasons of security, that PII should not be made available. Instead, the PII variables are hashed or encrypted (as described in the section below) prior to data sharing so as to reduce the risk that a person or any of their characteristics can be identified from the data.

Secure hashing or encryption of PII variables presents a challenge to data linkers since there is no longer any relationship between similar values. Thus, we are no longer able to make use of string similarity metrics such as Jaro-Winkler or distance measures such as Levenshtein Edit Distance (see Black, 2019 for a description of these and other string comparison tools). It is also impractical to apply probabilistic techniques (Winkler, 2016) that rely on clerically reviewing samples of the record pairs generated to set a threshold for automatic acceptance. More generally, we have found that the process of developing a matching algorithm for encrypted data is slowed down by our inability to clerically review the accuracy as we develop the method.

Despite the challenges, the Office for National Statistics (ONS) have developed a method, called Derive and Conquer, for linking hashed data. It is likely that the results of linking hashed data are of poorer quality compared with linking the data in the clear. However, if data suppliers are only willing to give us hashed PII, then it is important to understand the changes in linkage rates, accuracy and bias. This will guide whether the linked data can be used in the production of official statistics.

This chapter firstly describes different methods of anonymisation, and the hashing method used in our case. It then outlines the alternative methods for linking anonymised data and proposes a new Derive and Conquer linkage method. The methods used to quality assure the linkage method are explored. Lastly, we discuss some of the limitations of linking using hashed data and why, wherever possible, it is best to link using data in the clear.

3. Methods of secure anonymisation

To prevent disclosing PII about individuals it can be anonymised so as to be unintelligible. Methods of anonymisation range from very simple techniques such as replacing all names with their initial letter or phonetic Soundex code (see Gill, 2001 for a description of Soundex), to more sophisticated cryptographic encryption or hashing algorithms (Paar et al, 2009).

There is a trade-off between the security of the anonymisation, and the amount of similarity between similar variables once anonymised. For example, in the most secure hashing and encryption algorithms there is no relation between similar but different names such as ANNE and ANNA once the algorithm has been applied. Whilst this is good for security it means that string comparator methods become obsolete, making linkage harder. On the other hand, methods of anonymisation that retain some similarity may be better for linkage but are less secure since the anonymised data may leak information about the original variables.

Encryption algorithms apply a secret encryption key to a fixed length input (the plaintext) to produce an output (the ciphertext). Anyone who has access to the corresponding decryption key can apply the decryption algorithm to the ciphertext and recover the plaintext. Encryption is a one-to-one function. Hashing algorithms take an input of any length and produce an output of fixed length called the hash. Since there are infinitely many possible inputs and only finitely many possible outputs, hashing is a one-to-many function, but the probability of getting a collision (two different messages producing the same hash) is very small if the algorithm is well designed. There is no decryption key that can be used to recover the original plaintext from the hash. Hashing algorithms can be used without any secret key, but if they are used for security purposes then a secret value called a ‘salt’ can be used to prevent a dictionary attack. This is where a list of names, for example from a telephone directory, is hashed to create a look-up of plaintext names and their corresponding hashes. Concatenating a random string, the ‘salt’ to the plaintext variable before hashing makes dictionary attacks ineffective since unless the attacker knows the value of the salt, they are unable to create the dictionary look-up.

For the linking with anonymised data project at ONS, the PII variables have been hashed using the Secure Hashing Algorithm SHA-256 (NIST, 2015) with 256-bit salts. This algorithm outputs a string of 64 hexadecimal characters (256 bits). In our implementation, different variables are generally hashed with different salts. However, in cases where we may need to compare the hashed values of more than one variable, these are grouped together and hashed with the same salt. For example, this allows us to compare forename with middle name to check for transposition errors.

There is a lack of entropy in some of the PII and derived variables that are hashed for the Derive and Conquer method. This could lead to a frequency attack whereby an attacker who thinks that SMITH is the most common surname in the data set, finds the most common hash value in the surname field and assumes that this is the hash value of salt+SMITH. This is thought to be an acceptable risk since knowing the hash of salt+SMITH does not help the attacker to discover the salt value, the value of any other name in the data set or the value of any of the other variables.

To manage security, only a very small number of people have access to the salts. Hashed data and in-the-clear data are not stored with the same areas of the ONS secure distributed computing platform (DAP), which is based on a Cloudera data platform (see Cloudera website for details).

In the past, hashed data sets have been linked at ONS using a series of match-keys. A match-key is created by putting together pieces of information to create unique keys that are then hashed and used for automated matching, with the intention of eliminating some of the errors that might otherwise prevent an automated match (in cases where there is a discrepancy in one of more of the matching variables). For example, a match-key might be constructed from the first three characters of an individual’s forename and surname, combined with their date of birth, sex and postcode district. Only a limited number of match-keys could be used due to the limited number of variables, hence it is inevitable that some true matches are missed. Furthermore, the match-keys used have to allow for quite a lot of error in variables to avoid having an unacceptably low match-rate. Therefore, it is also likely that false matches are made. Abbott et al (2016) provide further details.

To overcome the lack of string comparator scores for names and dates of birth, ONS constructed similarity tables that could be used to implement score-based matching on anonymised data. This is described in ONS (2013). However, Culnane et al (2017) have pointed out some vulnerabilities regarding the similarity tables which may expose the data to frequency attacks or the recovery of some plaintext names.

In Schnell et al (2009), the authors describe a method of privacy preserving linkage using Bloom filters. This method provides obfuscation that still enables string comparator methods to be used. Niedermeyer et al (2014) presents a cryptanalysis of the Bloom filter method and suggests modifications to the basic method that makes it less vulnerable to cryptographic attack. However, as the authors note, further research on the cryptographic properties of modified Bloom filters and the effect of different databases on their security is desirable. As such, without this further research, we have not pursued the use of Bloom filters.

Homomorphic encryption schemes are an exciting development that enable arithmetic computations to be carried out on encrypted data. In theory this could lead to the ability to use string comparator methods on encrypted data. Unfortunately, at the current state of development, the time and computational overheads required are too large to make use of homomorphic encryption viable for our purposes. For example, Cheon et al (2015) describe an implementation of the Wagner-Fischer edit distance algorithm (Wagner, 1974) on genome sequences encrypted using a somewhat homomorphic encryption scheme. The algorithm took 27.5 seconds for two encrypted genome sequences each of length 8. The authors estimate that if the two sequences were each of length 50 then the algorithm would take one day to run! Such a time overhead makes it infeasible to use homomorphic encryption to link large data sets since millions of comparisons are required.

Finally, although not a method of linking with anonymised data, use of trusted third parties and separation of PII variables from analysis variables, is an approach that can work if all parties agree. Here the PII variables are separated from the analysis variables, with an ID key being retained on both variable sets. The PII from the two data sets to be linked are sent to a third party who is trusted by the holders of both data sets. The trusted third party links the two data sets together using the PII variables in-the-clear and returns the IDs of the linked records. The PII can now be discarded and the analysis variables joined back onto the anonymised IDs. This is arguably the best method, as linkage is performed with data in-the-clear, but analysts only ever see the anonymised analysis variables. However, the method relies on finding a trusted third party that satisfies both data providers, and this has not proved to be possible in our case. See Kum et al (2014) for further details of a trusted third party scheme.

ONS have now devised a method called Derive and Conquer which enables us to link hashed variables making use of numerous derived variables that allow for many more combinations of errors than in the traditional match-key method. This method is described in the next section.

5. Derive and Conquer Method

The Derive and Conquer Linkage method is a non-exact matching (that is, it allows for error) method designed to use rule-based deterministic matching to link encrypted data between two data sets. It has been specifically created for linkage of person entities between Department of Work and Pension (DWP) administrative data and ONS administrative and survey data sets. The version referenced in this document is designed for matching the DWP’s Customer Information System (CIS) data set to the National Health Service’s Personal Demographics Service (PDS) data set. It is generally expected that each record will only (correctly) match to one other record. It operates on a distributed computing system to allow processing of millions of records and billions of comparisons in a realistic time frame. This is an important aspect of the methodology, which could not be implemented on standard computing systems.

Derive and Conquer identifies pairs of matching records between two very large data sets (approximately 100 million records in each). Calculating the match status for every record from one data set against every record in the other would be far too computationally expensive, even for the ONS’s distributed computing system. To reduce the search space a loose blocking method is used prior to linkage. This is similar to blocking that is generally performed prior to using a probabilistic method, but which is not normally necessary for a deterministic method. It is needed here because of the dynamic method of calculating derived agreement status for each source variable between possible pairs of records.

Due to the inability to use distance metrics or string comparators on encrypted data, Derive and Conquer uses derived variables. These derived variables store substrings or representations of strings. For example, the first four characters of a string or the Soundex phonetic code are derived variables of a name. There are multiple derived variables for each of the source variables (first name, middle name, last name, date of birth, sex, address). For example, in our application we propose 18 derived variables based on first name. We can then look for exact agreement between these derived variables. This does mean that the derived variables must be agreed and supplied by the data owner for them to be used in linkage.

Using a traditional match-key method is impossible because having large numbers of derived variables would mean coding, running, and reviewing thousands of different match-keys. For example, using the derived variables for names in all possible combinations would result in more than 2,700 match-keys:

18 first name x 9 middle name x 17 last name variables = 2754 match-keys

If we also allow for transpositions of names, missing values, errors in date of birth, sex or address we would require approximately 10,000 match-keys to allow for all possible combinations that could represent a true match.

To reduce the number of match-keys, Derive and Conquer makes decisions based on derived agreement variables, which combine information on multiple derived variables into one. For example, a single variable of first name agreement would be populated by a value indicating whether a pair has exact, strong or weak agreement, transposed agreement, a missing variable, or no agreement found between first names. Derived variables have been reviewed clerically on in-the-clear test data to determine whether they indicate exact, strong, weak or no agreement. Having test data to be able to construct these indicators is another requirement of this method.

For example, when deriving agreement for date of birth, the matching on full date of birth indicates exact agreement, and all the derived date of birth variables indicate strong agreement. The different conditions for each level of agreement are shown in Table 1.

Variable name Derived Agreement Description
DATE_OF_BIRTH_EXACT Exact The full date of birth matches. Not a derived variable
DATE_OF_BIRTH_YM Strong The year and month match
DATE_OF_BIRTH_YD Strong The year and day match
DATE_OF_BIRTH_Y_TRI_MD Strong The first three digits of year match, and the month and day match
DATE_OF_BIRTH_YD_YM Strong The year matches and the month and day are transposed
Missing Missing Date of birth is missing in either one or both data sets
No agreement None Date of birth is present in both data sets, but none of the derived variables match

Table 1. Conditions for derived agreement status for source variable Date of Birth.

Match-keys are then run based on derived agreement. For example, a match-key might declare a pair of records a match if they have strong agreement on surname and forename, and exact agreement on postcode and date of birth.

Match-keys have differing strictness. Strict match-keys are more likely to only produce true positives but produce many false negatives, whilst looser match-keys reduce false negatives but have a higher likelihood of causing false positives. Due to the parallelised implementation, all records are passed through all match-keys. If a loose match-key produces match pairs that contradict a match decision from a stricter match-key, Derive and Conquer defers to the stricter decision and discards the looser one. The match-keys are hierarchical, match-key 1 is the strictest and match-key 44 is the loosest. This allows us to deal with non-unique matches. For example, if record A is matched to record B in match-key 1 (the strictest match-key), but record A also meets the criteria for match-key 44 with record C, only the A to B link would be retained as shown in table 2.

Record from data set 1 Record from data set 2 Match-key Decision
A B 1 Retain
A C 44 Discard

Table 2. Dealing with non-unique, non-exact matches

It should be noted that although match-key 44 is the loosest match-key in our application, our research into derived variables indicate that it is still strict enough to keep the number of false positive matches generated to within tolerable limits.

As well as non-unique matches between match-keys such as the example above, there is also the potential for non-unique matches within the same match-key. These are dealt with in two ways. Firstly, non-unique matches that are from match-keys requiring exact agreement, are assumed to be correct matches caused by duplicate records in one or both data sets. Secondly, for non-unique matches from the looser match-keys, we discard all of the matches, as we cannot be certain which one is correct. It should be noted that if a record has been matched uniquely by a stricter match-key, the stricter match-key’s decision is retained.

The steps of the Derive and Conquer algorithm and how the algorithm makes use of parallelisation is shown in Figure 1. In this process, agreement is derived and matchkeys are run in parallel on each record pair.

Figure 1. The steps of the Derive and Conquer algorithm

As well as our agreement level match-keys, Derive and Conquer also uses two innovative matching methods: unique biography and association of geography. These novel concepts can be applied to in-the-clear matching as well and were originally developed as part of our census matching methodology (ONS, 2019).

Unique biography uses the idea that agreement on a rare name is more indicative of a match than agreement on a common name. Within each data set, each forename, surname, and date of birth combination is compared to the rest of the data set, and any combinations that only appear once are marked as a unique biography. This is applied in the form of a match-key that looks for a matching and unique biography for both records but does not require any agreement on geography. Note that this approach is only applicable for population coverage data sets as uniqueness is not meaningful in smaller data sets.

The association of geography match-key looks for evidence of households who have moved to a new address. For example, four people in one data set may have the address Flat B, 67 Straight Road, Reading, RE4 TLE. All four people match to people in the second data set on name, date of birth and sex, but the address is now 7a Zigzag Street, Windsor, W8 6TH on each person’s record. The address 67 Straight road is associated with 7a Zigzag Street more than once, which we can use as evidence of people moving to a new address. This can be incorporated into a match-key that looks for agreement on name, date of birth and sex, and an associated geography match. This match-key will not be able to tell if a single person household has moved.

The output of Derive and Conquer is a list of matched record IDs from two data sets and lists of the unmatched residuals from each data set.

Clerical review is not possible on encrypted data, so this method has been refined using in-the-clear proxy data from the ONS 2011 Census and National Health Service 2011 Patient Register. The results of Derive and Conquer linkage of Census and Patient Register were comparable with the results of in-the-clear linkage. A clerical review of the linked CIS to PDS data set will be done on a sample of in-the-clear data in early 2020.

6. Method of quality analysis

To allow quality assurance and enable improvement of the Derive and Conquer method, DWP have agreed to supply ONS with samples of matched and unmatched CIS records in the clear. We will carry out a clerical review of matched records from each match-key so that we can estimate the false positive rate for the method. Using the sample of unmatched records, we will apply a probabilistic linkage method to try to find matches in the entire ONS data set. This will enable us to estimate the false negative rate. Furthermore, by clerical review of false positives and false negatives we may be able to suggest changes to the algorithm, either in terms of new derived variables or alterations to match-keys, that will improve its performance. Any changes made will then need to be tested on the hashed data sets and further quality analysis carried out to measure their effect.

In addition, we will carry out analysis on the ONS matched and unmatched records to determine the level of bias in the linkage in terms of age, sex, location and ethnicity. We will require DWP to carry out a similar bias analysis using the full CIS data set. Note that ethnicity is not a variable on either CIS or PDS, but we can link PDS to another data set containing ethnicity to estimate the ethnicity biases. We will determine whether our linked data set has the same statistical characteristics as other ONS surveys in terms of age, sex, geography and ethnicity.

We will also report on the number of duplicates, the number of links and unique links for each match-key. Where there are clusters of linked records, we will report their size and which match-keys are contributing to the clusters.

7. Key concerns and limitations

Whilst the Derive and Conquer method has been made to work (although the quality of the linkage is still to be determined) this has not been without some difficulties. Putting aside the question of linkage quality for now, there remain other limitations that may deter organisations from adopting this approach. Note however that most of these will also apply to other methods for linking with encrypted or hashed data.

Firstly, it is imperative that all parties, that is holders of the data sets to be linked, follow exactly the same protocol to clean, standardise and hash their respective data sets. This means that all parties must agree these protocols, and no party can make any changes without first gaining agreement by the others. Different technology may be used by different parties leading to discrepancies in the outputs. For example, we found that different implementations of the Double-Metaphone function (Black, 2019) produced slightly different outputs in some cases when run on different platforms. Such discrepancies must be resolved before the variables can be hashed and compared.

There are likely to be large costs involved in the cleaning, standardisation and hashing processes that may be out of proportion to the benefit of linking the data. This is especially true in cases where the linkage variables are of very poor quality.

All data sets that need to be linked to the hashed data set have to themselves be standardised, cleaned and hashed. This work cannot be distributed to the relevant business areas due to the exacting standards. Furthermore, the hashing can only be carried out by a small number of people who have access to the salt values. Data sets in-the-clear and hashed have to be kept separately from each other which requires setting up of new spaces in the secure computing environment.

The development of a bespoke matching method using in-the-clear data involves many iterative steps whereby match-keys and parameters are tweaked, and the results are clerically reviewed to determine if the effect is to make things better or worse. Since clerical review is not possible with hashed data, we have developed the method using data sets which we had available in-the-clear. Whilst the method works well on these data sets, there is no guarantee that it performs to the same standard on the hashed CIS data. We will not know how good the method is until we have been able to perform our quality analysis. To carry out quality analysis we must have access to some data in the clear. This need will be ongoing as different data sets are linked and as the method develops. Therefore, the parties must agree beforehand that samples in-the-clear will be made available somehow. If this is not possible then there can be no assurance of the quality of the linkage and as such it should not be used to produce statistics.

All parties will be required to contribute to the quality analysis of the linkage. Samples in-the-clear will need to be produced from each data set that is hashed and linked. Furthermore, only the holder of the data set in-the-clear can carry out the necessary bias analysis.

We have used Derive and Conquer to match two large data sets. Our implementation makes billions of comparisons, made possible by our use of the ONS distributed computing platform. It is unlikely that the method would work, with data sets of this size, on a centralised platform.

There is a lack of entropy in some of the PII variables and derived variables that are hashed. This means that an attacker could associate a hash code with a particular value. However, a brute force attack (trying all possible salt values until the correct one is found) will still be required to find the secret salt value and without the salt value no other values can be discovered in the hashed data. It is therefore felt that concern regarding lack of entropy should not prevent the use of the Derive and Conquer method.

8. Benefits of matching in-the-clear

Using in-the-clear data for linkage is likely to yield a higher match-rate whilst keeping the number of false positives and false negatives to a minimum. Bespoke matching algorithms can be developed in a timely manner and the effects of changes reviewed instantly. Clerical review can be used to resolve matches which cannot be made automatically, this will improve both the match-rate and the accuracy.

Less work is required up front in terms of standardisation and cleaning. Although this work should still be carried out before linkage, it is possible for the linkage team to do this on both data sets and on the same platform leading to less issues with discrepancies. Cleaning methods can be tailored to the particular quirks of the data set in question.

It is arguably less expensive to link data in-the-clear because it is quicker, does not require extra project spaces and is logistically and administratively simpler to handle. Note that storing hashed data has not meant that there is any reduction in the amount of security used, so there is no saving in terms of cost or effort with regards to security.

9. Recommendations

Whenever possible, data acquisition teams should negotiate to receive data in the clear. Where this is not possible, a substantial amount of resource needs to be made available to apply methods for linking hashed data. This applies to the department supplying the data as well as the department receiving and linking it. It must be agreed between the parties that samples of in-the-clear data are somehow made available for the purpose of quality review of the linkage. This quality review is an ongoing process as the method is honed and different data sets are linked and will involve both parties.

Ideally a well-resourced comparison of matching methods for hashed data compared with data in-the-clear should be undertaken. To date, such comparisons have not been fully undertaken and although the difference in quality between linking hashed and in-the-clear data is apparent, we do not believe that the full quality implications of linking with hashed data are understood as yet. In particular, the effect of using hashed linkage on the quality of subsequent analysis has not been studied.

10. Concluding comments

Whilst we recommend that data linkage should be carried using variables in-the-clear, we accept that in some instances this is not possible. We hope to show that it is possible to carry out data linkage to a good enough standard in some instances when linkage variables are hashed. However, this comes at a cost in terms of finance and speed, and requires a great deal of effort from all parties who will have to work together at every step. In some instances, when the quality of the linkage variables is particularly poor, linking with hashed data is unlikely to yield results which meet targets for accuracy, or else have unacceptably low match-rates. In such cases the results of the linkage cannot be used to produce reliable statistics and hence there is little point proceeding with such a costly exercise.

11. References

Abbott O, Ralphs M, Jones P. (2016). Large-scale linkage for total populations in official statistics (accessed 06.03.20), chapter 8 in Harron et al (2016)

Black P (2019), Dictionary of Algorithms and Data Structures (accessed 06.03.20)

Cheon J, Kim M, Lauter K (2015). Homomorphic Computation of Edit Distance (accessed 08.06.20)

Cloudera. (accessed 06.03.20)

Culnane C, Rubinstein B, Teague V. (2017) Vulnerabilities in the use of similarity tables in combination with pseudonymisation to preserve data privacy in the UK Office for National Statistics’ Privacy Preserving Record Linkage.

Harron K, Goldstein H, Dibben C. (2016). Methodological Developments in Data Linkage. Wiley, ISBN: 9781118745878

Kum HC, Krishnamurthy A, Machanavajjhala A, Reiter MK, Ahalt S. (2013). Privacy preserving interactive record linkage (PPIRL). J Am Med Inform Assoc. 2014;21(2):212–220. doi:10.1136/amiajnl-2013-002165

NIST (2015). (accessed 06.03.20)

ONS (2013). Beyond 2011: Matching Anonymous Data.

ONS (2019). Methodology report on household and associative coverage matching for the 2021 Census. Internal report available on request.

Parr C, Pelzl J, Preneel B. (2009) Understanding Cryptography: A Textbook for Students and Practitioners. Springer ISBN-10: 3642041000

Schnell, R, Bachteler T, Reiher J. Privacy-preserving record linkage using Bloom filters. BMC Medial informations and Decision Making. 2009, 9:41 doi:10.1186/1472-6947-9-41.

Wagner R, Fischer M. (1974) The string to string correction problem. Journal of the ACM, 21(1): 168-173, 1974

Winkler, W. (2016) Probabilistic linkage, chapter 2 in Harron et al (2016)