MoJ : Splink Master Record

Splink is an open-source tool for probabilistic data linkage tool that enables fast, accurate and scalable linkage and deduplication of data that lacks unique IDs

1. Summary

1 - Name

Splink

2 - Description

Splink is an open source Python library for probabilistic record linkage. It is designed for linking and deduplicating datasets that lack a unique identifier

It is used to link persons between the Ministry of Justice’s courts, prisons and probation datasets in both batch and real-time deployments. For example: - It is used weekly to refresh linked datasets for statistical analysis - It is used in courts to find probation records associated with individuals coming to court - It is being piloted as part of Core Person Record, a product that aims to create a unique identifier for persons across prisons, probation and the criminal courts - It is used to find Police National Computer (PNC) numbers associated with individuals, in order to request relevant arrest information from the police.

3 - Website URL

https://github.com/moj-analytical-services/splink

4 - Contact email

data_linking_team@justice.gov.uk

Tier 2 - Owner and Responsibility

1.1 - Organisation or department

Ministry of Justice

1.2 - Team

Data Linking team

1.3 - Senior responsible owner

Chief Data Scientist

1.4 - Third party involvement

No

Tier 2 - Description and Rationale

2.1 - Detailed description

Splink compares records of individuals within and across case management systems throughout the justice system (prison, probation and courts). When comparing the personal information (names, dates of birth, addresses etc.) of these individuals, the Splink model produces a probability score that the two records refer to the same person. This is known as probabilistic data linkage. Any record pairs with a match probability above a specified threshold are then considered as the same person, with this person being assigned a new linked identifier.

Details of the statistical model used can be found in the documentation website here: https://moj-analytical-services.github.io/splink/topic_guides/theory/fellegi_sunter.html

Splink is integrated into a number of systems. At the moment, it is used for:

  • JustLink datasets: Each week, data is linked between prisons, probation and courts. The result is a fully anonymised lookup table that contains the links between records (but NOT the personal identifiers). The data is available to analysts to perform statistical analyses, such as understanding how long it takes cases to progress between courts.

  • Probation in Court: To find probation records associated with an individual when they come to court, as part of the process of preparing a case for sentence.

  • Core Person Record. This is a real-time linkage system that is currently being piloted that aims to create a unique identifier that links together each person across MoJ’s criminal data systems (specifically courts, prisons and probation). In this system, as records are created and updated, Splink is used to predicts whether the record links to other existing records in the system.

  • Police data sharing pilot. A regular data sharing system being piloted in Essex. Splink is used to identify Police National Computer (PNC) numbers associated with individuals supervised by North Essex Probation Delivery Unit. Those numbers are sent to the police each day to identify if any arrests have occurred.

Whilst the Splink algorithm uses personal identifiers to match records, it does not make personal identifiers more widely visible. Access to the PII itself remains limited to authorised staff under existing controls.

2.2 - Benefits

The introduction of Splink increases the speed and accuracy of linkage. The availability of linked data has a variety of benefits such as: - More reliable research and policy analysis – Analysts can now generate insights based on linked justice system data rather than fragmented datasets, supporting evidence-based decision-making. - Reduction in data duplication – By deduplicating records, Splink ensures that justice system statistics and analytics reflect unique individuals rather than multiple records of the same person. - Improved operational efficiency – The ability to track individuals across the justice system more accurately supports better resource allocation and case management. These improvements benefit the public by furthering the MoJ’s priority outcomes—delivering swift access to justice, protecting the public, and reducing reoffending

2.3 - Previous process

Prior to the introduction of Splink, linking records across the justice system was done by different teams in different ways.

Linked data for analysts relied on a time consuming process of deterministic linkage. As a result, it was up to a year before linked data was available to analysts, slowing down the process of important work such as evaluating the effectiveness of interventions

For linking and deduplicating data between operational systems, a mixture of deterministic linkage and human intervention has been used.

2.4 - Alternatives considered

Before we built Splink, we tested several existing open-source packages (e.g. the R fastlink and the Python recordlinkage package) for suitability. None of the existing packages for probabilistic record linkage worked at the scale required for data linkage in government (10s-100s of millions of records). Therefore the Internal Data Linking team in the Ministry of Justice built Splink.

Deterministic (i.e. rules-based) linkage options were also considered. However, deterministic linkage was deemed unsuitable due to its: - Inability to capture nuance - Tendency for high false negative links - Difficulty in managing large numbers of rules for complex datasets.

Tier 2 - Deployment Context

3.1 - Integration into broader operational process

Splink is used to determine which records pertain to the same individual across a range of datasets.

The process of matching is usually automatic, but in some cases, Splink is used to show a list of possible matches to a human via a Graphical user interface (GUI).

The tool is capable of providing detailed information to the user on exactly why a particular prediction was made, see https://moj-analytical-services.github.io/splink/demos/tutorials/06_Visualising_predictions.html. This capability is predominantly used at the model design and training stage, as opposed to inference.

The group of records pertaining to an individual is called a ‘cluster’, and this is useful for both aggregate statistical use cases, such as identifying how long it takes for cases on average to progress through the court system, to operational use cases such as preparing a case that is being seen in the courts for sentence.

3.2 - Human review

Human review is typically used during the model training and quality assurance process to quantify accuracy. In use cases where Splink provides a list of possible matches to a human, the human then makes the final decision.

3.3 - Frequency and scale of usage

Splink is used continuously as new and updated data on persons in the justice system is received by the department.

3.4 - Required training

Not applicable - in most cases, the process of matching is automatic and happens in the background. Relevant training is in place for staff who perform manual merges and unmerges.

3.5 - Appeals and review

There are no complaint procedures specific to Splink itself since it does not directly make decisions about individuals. However, there is a formal complains procedure for the various parts of the justice system which could be used if something was perceived to have gone wrong. Where linkage errors are identified, it is possible to record manual overrides in the system to prevent them re-occuring.

Tier 2 - Tool Specification

4.1.1 - System architecture

https://github.com/moj-analytical-services/splink

4.1.2 - System-level input

Tabular data containing personally identifiable information from the courts, prisons and probation system

4.1.3 - System-level output

Match scores for pairs of records. If the score exceeds a certain threshold, the model matches the records together. These predictions are then turned in to ‘clusters’ - groups of records from the various input data systems which pertain to the same individual. These clusters provide a succinct summary of the links found by the model.

4.1.4 - Maintenance

Continuous monitoring of matching results to identify errors. Re-training does not follow a regular schedule - instead it would be done as part of continuous improvement or in response to a rise in errors.

4.1.5 - Models

Splink uses a single model for linkage - the Fellegi-Sunter model. This model is based on Bayesian Statistics, is well-researched and understood as the industry standard for record linkage.

For more on how the Fellegi-Sunter algorithm works, see the Record Linkage Theory section of the Splink docs site (https://moj-analytical-services.github.io/splink/topic_guides/theory/fellegi_sunter.html) and the academic paper (https://imai.fas.harvard.edu/research/files/linkage.pdf) used as the basis for the implementation of the algorithm.

Tier 2 - Model Specification

4.2.1. - Model name

Splink

4.2.2 - Model version

Latest version as of June 2025 is Splink v4.0.8

4.2.3 - Model task

Data linking and deduplication, i.e. determining whether a pair of records pertains to the same individual or different individuals.

4.2.4 - Model input

Pairs of records containing personally identifiable information from the courts, prisons and probation system.

4.2.5 - Model output

Match scores for the pairs of records. If the score exceeds a certain threshold, the model matches the records together.

4.2.6 - Model architecture

Probabilistic Fellegi-Sunter record linkage model trained using unsupervised learning (the Expectation Maximisation (EM) algorithm). The EM approach is documented here: https://www.robinlinacre.com/em_intuition/ and a detailed tutorial of how model training works is provided here: https://moj-analytical-services.github.io/splink/demos/tutorials/00_Tutorial_Introduction.html

4.2.7 - Model performance

Data linkage is an unsupervised problem, so traditional machine learning accuracy metrics (e.g. precision, recall, F1 score) cannot be relied upon to reflect the true performance of a model.

Clerical labelling (i.e. manual labelling by a human) has been performed on a sample of record pairs to provide a reference point for results generated by the model. These labels cannot be considered as a ““ground truth”” (such as in a supervised problem), as a human cannot be sure if two records match or not. The results of clerical labelling vary depending on the person labelling the data. Instead, metrics derived from these labels provide a rough guide of whether the linkage matches what a person would expect.

Model performance is also assessed by spot-checking record pairs (https://moj-analytical-services.github.io/splink/topic_guides/evaluation/edge_overview.html#spot-checking-pairs-of-records), where the outcomes for different types of matches can be assessed against what a human would expect. This is generally targeted for records close to a linkage threshold (over which a link is deemed to be valid). Tools, such as the Comparison Viewer Dashboard (https://moj-analytical-services.github.io/splink/charts/comparison_viewer_dashboard.html), are provided within Splink to facilitate this exploration.

4.2.8 - Datasets and their purposes

DELIUS (probation) NOMIS (prison) Common Platform (criminal courts) LIBRA (magistrates’ court)

2.4.3. Development Data

4.3.1 - Development data description

DELIUS (Probation) NOMIS (Prison) Common Platform (Criminal Courts) LIBRA (Magistrates’ Court)

For JustLink only: FamilyMan (Family Courts) CaseMan (Civil Courts)

None of these dataset are open data

4.3.2 - Data modality

All datasets are tabular text data

4.3.3 - Data quantities

Size of data pre-deduplication is: LIBRA: 19.6m records Common Platform: 2.9m records NOMIS: 2.2m records DELIUS: 2.4m records Familyman: 20.9m records Caseman: 17.7m records

Data was not split for training

4.3.4 - Sensitive attributes

For most applications, core fields used are as follows: - Date of Birth - Name(s) - Current and past addresses - Sentence date(s)

In some applications such as deduplicating prison data, we also use additional fields: - Ethnicity - Birth place - Nationality - Gender - Person’s height and weight

These attributes are to identify whether pairs of records pertain to the same person

4.3.5 - Data completeness and representativeness

The data is the full population of individuals in the justice system, not a sample.

Data suffers from a variety of data quality problems including missingness in, typos, and the use of aliases or incorrect data.

The purpose of Splink is to address these data quality problems.

4.3.6 - Data cleaning

Data cleaning and standardisation is performed to ensure records are comparable. The actions undertaken are for instance; upper casing, standardisation of punctuation, removal of invalid postcodes, standardising format of DoB to yyyy-mm-dd

4.3.7 - Data collection

Individuals’ personally identifiable data is routinely collected in the department’s administrative data systems for the purposes of managing court cases and offenders. A key reason the department and its agencies collects personal identifiers is to ensure we can be confident about the identity of individuals.

4.3.8 - Data access and storage

Development data is stored securely in the department’s analytical platform https://user-guidance.analytical-platform.service.justice.gov.uk/ .

The Analytical Platform in hosted in a cloud-based ecosystem that is easy to access remotely from all MoJ IT systems. Designed for data at security classifications OFFICIAL and OFFICIAL-SENSITIVE, we follow NCSC Cloud Security Principles, implementing features such as: - two-factor authentication - data encryption at rest and in transit - granular access control - extensive tracking of user behaviour, user privilege requests/changes and data flows - multiple isolation levels between users and system components

The data is accessible only to staff on the data linking team who need access for model development and quality assurance. There is no de-identification because the personal identifiers are essential to the task of predicting whether two records pertain to the same person.

4.3.9 - Data sharing agreements

N/A - all data is internal to MoJ

Tier 2 - Risks, Mitigations and Impact Assessments

5.1 - Impact assessments

5.2 - Risks and mitigations

Key risk 1: Errors can be made in data linkage.

Detailed clerical review has been used to mitigate this risk and attempt to quantify error rates. Given existing approaches to the problem of data linkage, we believe the use of Splink reduces (but does not totally eliminate) errors and so reduces the risks associated with linkage errors.

We have automated processes in place to monitor linkage quality and detect anomalies, and are working to improve these processes to more quickly identify and rectify errors.

Key risk 2: Linkage may be more or less accurate for different groups. This could occur, for instance, if more typos are made for records for people of a certain demographic group, or there is greater missingness in the data for certain groups. Work has been undertaken to understand bias in data linkage. Again, given existing approaches to the problem of data linkage, we do not believe this problem is made worse by the use of Splink

Mitigation Differential linkage rates occur when we have poor data quality on peoples’ identity and the primary purpose of the Core Person Record project is to improve data quality across all records, particularly those that are currently difficult to link.

Updates to this page

Published 6 October 2025