Guidance

Glossary

Updated 16 July 2021

This is a glossary for terms found in Joined up data in government: the future of data linking methods.

Term Acronym Definition
Active learning   A branch of semi-supervised learning where at each iteration the system identifies the unlabelled items that are most likely to be informative and asks a user to provide labels/responses from which it can learn.
Aggregate data   Record level data summed to create a table. Aggregate data includes frequency tables and magnitude tables.
Anonymisation   A potentially complex process that transforms personal and/or identifiable data into not identifiable data. This process requires that identifiers be removed, obscured, aggregated and/or altered in some way, and that combinations of variables that allow indirect identification are protected by one or more methods of statistical disclosure control..
Attrition   In research studies, attrition is a type of missingness that occurs when participants leave during a study.
Bayesian latent class model   A Bayesian model where the parameter space includes latent variables; used in classical record linkage
Binary decision   A decision in machine learning that can take two possible values, such as true/false or linked/not linked.
Bit vector   A data structure within which each element is a bit, having values of either 0 or 1.
Blocking   Blocking is used in data linkage before applying a probabilistic linkage algorithm. The blocking algorithm brings together possible pairs of records, for example those that agree on postcode. This reduces the size of the search space as it is no longer necessary to give a score to every possible pair of records. However, it is important to ensure that the blocking does not cause genuine matches to be missed because they are never blocked together. To mitigate this risk, multiple blocking passes are often used. Variables that are used in blocking should not be used in the probabilistic matching algorithm since there will always be agreement between them.
Bloom filter   Data structures used to allow comparisons of encrypted identifiers.
Cleartext   Readable data that is stored in an unencrypted format
Clerical linkage   Manual matching used to evaluate ambiguous or automated matching procedures. Also known as clerical review.
Confidentiality   The right or expectation of an individual or organisation to not have information about them disclosed.
Cryptography   Computer science techniques for preventing the disclosure of sensitive data to unauthorised parties by enciphering messages to make them unreadable.
Data subject   Person whose personal data is being collected, processed and stored. Under relevant legislation data subjects do not include the deceased or those who cannot be identified or distinguished from others..
De-duplication   In data linkage this refers to the process of finding and eliminating two or more records that relate to the same entity
Derive and conquer method   A non-exact matching method designed to use rule-based deterministic matching to link hashed data between two data sets. This was specifically created for linkage of person entities between Department of Work and Pension, His Majesty’s Revenue and Customs administrative data and ONS administrative and survey data sets.
Deterministic matching   Deterministic matching uses a set of rules to decide whether or not a pair of records should be matched. If there is a unique identifier such as national insurance number, the rule might simply be match records if the unique identifier matches. In other cases, match-keys are used to define a set of rules that get sequentially less strict.
Dice similarity/ Sørensen–Dice coefficient   A statistic used to gauge the similarity of two samples.
Distributed computing   Distributed computing is a model in which components of a software system are shared among multiple computers to improve efficiency and performance.
Edge effects   Factors in the collection and manipulation of longitudinal administrative data that may affect their use for statistical outputs. These may be categorised into: a) Left-focussed initial quality improvements. In the early stages of administrative data collection and processing there can be initial quality issues which will become apparent and be corrected as the data system beds down to a ‘business-as-usual’ production mode. The effect may be a slight degradation of data quality in the earliest data which diminishes over time. b) Right-focused incompleteness. Whenever a data extract is taken the data become right- censored; there will be events or objects that sit outside, after the point at which the data cut is taken. This refers to incomplete information within the data extract, either in objects or in attributes, which arises because of the data cut-off. This could render the object data unusable or mean that the latest events in the data cannot be used because they are incomplete. c) Right-concentrated retrospective edits. In a live processing environment, it is possible that data errors may be corrected retrospectively. The scale of retrospective editing is important because it alerts users to the quality deficit created by right-focused incompleteness. Logically we could expect retrospective edits to be focused around the later dates that the extract covers. In practice, right-focused incompleteness may make the latest data in an extract unreliable and therefore unusable. If potentially incomplete records are used to produce statistics, the statistics could be considered ‘provisional’ and subject to confirmation when a later extract becomes available.
Edit distances   The minimum number of changes that need to be made to one (text) string to convert it into another string. E.g. the Levenshtein distance from John to Jonathan may be of interest in linking administrative data.
Encoding   The process of converting data from one form to another.
Encryption   Scrambling information to prevent unauthorised disclosure or modification using mathematical techniques.
Expectation Maximisation EM A statistical method for estimating the most likely value for parameters in a machine learning model that represents a data set.
False positive   If two records are incorrectly matched, this is called a false positive.
F-measure   Calculation using precision and recall for an overall linkage quality indicator.
Frequency table   Table of counts often used to display data collected in surveys and censuses. Each cell in the table is a frequency or count. Examples would be the number of males aged over 65 living in a particular geography such as Output Area or the number of businesses in a specified sector in a Local Authority.
Hashing   The process of generating a value or values from a string of text using a mathematical function. This is used in data processing to assure security of entities within data sets by making all personally identifiable information pseudonymised.
Homomorphic encryption   A form of encryption that allows any data to remain encrypted while it’s being processed and manipulated.
Imputation   Imputation is the process of using an algorithm to estimate an unknown value..
In the clear   See cleartext
Latent class model   A Bayesian model where the parameter space includes latent variables; used in classical record linkage.
Linkage quality   Linkage quality is measured using metrics such as precision and recall. Ideally it would also describe any biases in the linkage, for example reporting on the distribution of matches and residuals by age group, ethnicity, gender etc
Machine learning   An application of artificial intelligence that allows systems to learn automatically and improve from experience without being explicitly programmed. Machine learning algorithms receive input data, using statistical analysis to look for patterns in the data whilst constantly updating outputs as new data become available.
Match-keys   Match-keys are used in the deterministic matching of administrative data sets. They are created by combining identifying variables (or parts of them) such as name, sex, date of birth and postcode in all of the data sets to be matched.
Match-rate   A measure of the proportion of records that have been matched using a particular implementation of a matching algorithm. The match-rate is not a measure of quality as it does not reflect the accuracy of the linkage.
Metaphone/double metaphone   See Soundex for context. Metaphone expands on Soundex with a wider set of English pronunciation rules and allowing for varying lengths of keys, whereas Soundex uses a fixed-length key. Double metaphone refines the matching by returning both a primary and secondary code for each name, allowing for greater ambiguity.
Microdata   Microdata (also known as record-level data or row-level data) are data on the characteristics of units of a population, such as individuals, households, or establishments, collected by a census, sample survey, or from administrative data. The data are in the form of tables where each row corresponds to an individual person, business, other statistical unit or event.
Missed match   Records belonging to the same subject that have not been linked together Also known as a false negative.
Missing at random MAR When the probability of missing data can be explained by variables within the available data. For example, when a sample is taken from a population and the probability of being included within that population depends on a known variable.
Missingness   The manner in which data are missing from a sample of a population.
M-probability   The probability that attributes match given they are related to the same unit.
Multiple imputation   An imputation method where each missing value is imputed M times to produce M datasets which reflect the inherent uncertainty in the imputation. The M datasets can be combined to obtain inferences with valid standard errors and confidence intervals..
Obfuscation   A data security technique that copies and scrambles sensitive data, often via encryption, as a means of concealing personal identifying information within.
Personal data   Information relating to an identified or identifiable living person.
Personally Identifiable Information (PII)   Any data that can be used to identify an individual.
Perturbative methods   Changes have been made to some values of the data. This includes methods such as rounding and the addition of noise.
Precision   The precision of a linkage algorithm is a measure of how accurate the matches are. Precision is the proportion of links made that are true positives.
Privacy-Preserving Record Linkage PPRL A technique identifying and linking records that correspond to the same entity across several data sources held by different parties without revealing any sensitive information about these entities.
Probabilistic matching   Probabilistic matching algorithms calculate scores for pairs of records. The scores are calculated by assigning agreement and disagreement weights to each of the matching variables. These weights can be generated using a training set where the match status of the pairs is known, or by using an Expectation Maximisation algorithm. A technique called blocking is generally used to generate candidate linking pairs first, since linking every record in the first data set to every record in the second data set generates too many candidate pairs, most of which are not matches. When every pair of records has been given a score, the record pairs are arranged in order of score (highest to lowest). A threshold is found, above which all record pairs are accepted as matches. In some cases, a lower threshold is also found below which all record pairs are rejected as matches. Records in between the two thresholds are clerically reviewed to determine their match status.
Processing   Any operation which is performed on personal data e.g. storage, dissemination, destruction.
Pseudonymisation   The initial step when protecting microdata. This is the act of removing direct identifiers from a record and replacing with some other meaningless id (such as a row number). For example, names, addresses, NI numbers will be removed as the first stage of the anonymisation process. Although direct identifiers have been removed the data are not protected. An intruder could still identify a record with known characteristics which would be uncommon in the data. This process is sometimes incorrectly referred to as anonymisation. It should be noted that this is only one part of the anonymisation process..
Pseudo-Random Number Generator PRNGs A method for encoding identifiers that generates random numbers.
PySpark   The Python Application Programming Interface for the distributed computing system Apache Spark.
Q-gram   A string of length q
Random hashing   Using new hash functions at random from a family of hash functions.
Random noise   Random numbers that are generated to be added to raw data values. Noise may be positive or negative. It is often chosen from a fixed statistical distribution centred at 0, e.g. the Laplace distribution.
Record-level data   (See microdata) data on the characteristics of units of a population, such as individuals, households, or establishments, collected by a census, sample survey, or from administrative data. The data are in the form of tables where each row corresponds to an individual person, business, other statistical unit or event.
Residual record   When two (or more) data sets are linked together, there are inevitably some records that cannot be matched. These are called residual records. Residual records might be true negatives or false negatives.
ROC curve   Receiver operating characteristic curve. Measure of linkage accuracy shown as a graphical representation of sensitivity and specificity.
Salt   In security, a salt is random data used to modify the input of a hash function to guarantee a unique output. This adds another hashing layer on top of an encryption algorithm, increasing the difficulty of reversing the encryption of the data.
Secure Multiparty Computation MPC/SMPC A cryptographic protocol that distributes computation across multiple parties, where no individual party can see the others’ data.
Secure Research Service SRS The Secure Research Service (SRS), formerly the Virtual Microdata Laboratory (VML), is an ONS facility for providing secure access to sensitive detailed data, typically microdata.
Sensitivity (recall)   The proportion of true-matches that are correctly classified.
SOUNDEX   A Microsoft SQL function that acts as a form of phonetic coding. Used to group similar-sounding words (usually person/place names) together by assigning them the same 4-digit code, based upon the way in which the words are pronounced by an English speaker.
Specificity   The proportion of true non-matching records that are correctly classified.
Splink   A PySpark record linkage package based on the Fellegi-Sunter model, developed by Ministry of Justice.
Supervised learning   A form of model learning that makes use of a set of training examples, alongside the correct response for each. The differences between the system’s outputs and the true outputs are used to adapt the learned model.
Synthetic data   Data that do not relate to real statistical units but have the look and structure of real data. They will have been generated from one or more population models, designed to be non-disclosive, and used either for teaching purposes, for testing code, or for use in developing methodology.
Training data   An initial set of data used to help a program understand how to apply technologies like neural networks to learn and produce sophisticated results.
True negative   If records that do not belong to the same entity are correctly not matched, this is called a true negative.
True positive   If two records are matched correctly, this is called a true positive
Trusted Third Party TTP An independent entity which has no stakes in using either source or resulting data guaranteeing the confidentiality of any personally identifiable data.
Unsupervised learning   A form of model learning that makes use of a set of training examples where there are no labels or responses. Typically used for clustering. Model adaptation is driven by statistical measures describing the ‘quality’ of the clustering.
U-probability   Probability that attributes match given they are from different units.
XOR   A binary operation that stands for “exclusive or” that outputs true when inputs to the function differ.