UK Heath Security Agency: TB Country of Birth support tool

A tool to support manual processing of country of birth records, as a part of the invitation for TB testing for those from high-risk countries.

Tier 1 Information

1 - Name

TB Country of Birth support tool

2 - Description

One factor that contributes to the decision of the NHS offering Tuberculosis (TB) testing to recent migrants is their country of birth. This information is relevant because different countries have very different prevalence of TB. When people register for a General Practitioner (GP) they provide up to 22 characters of freetext on where they were born. To support effective data processing, we use large language models as part of a software pipeline to disambiguate countries from these free-text records. Before using this tool, thousands of these records had to be reviewed manually due to records containing: different names for countries, mispelling of countries, or sub-country geographies. The tool can transform these inputs into a standard list of countries much more quickly and with a high degree of agreement with manual processing, leaving a reduced number of records to check manually.

3 - Website URL

N/A

4 - Contact email

ai@ukhsa.gov.uk

Tier 2 - Owner and Responsibility

1.1 - Organisation or department

UK Health Security Agency

1.2 - Team

Advanced Analytics

1.3 - Senior responsible owner

Deputy Director Advanced Analytics and Chief Economist

1.4 - External supplier involvement

No

Tier 2 - Description and Rationale

2.1 - Detailed description

The NHS in England offers post-entry Latent Tuberculosis (TB) testing to some migrants who are eligible under the programmes criteria. Migrants who are offered testing decide whether to take up this testing having been offered it; it is voluntary. A migrant’s country of birth is one factor in the NHS’s decision, about whether to offer post-entry Latent Tuberculosis (TB) testing. This information is relevant because different countries have very different prevalence of TB. Other factors include time since entry, age, history of TB or Latent TB, previous testing, which are not covered by the scope of the tool.

When people register for a General Practitioner (GP) they provide up to 22 characters of free-text on where they were born. Not all of these free-text records can be matched with a basic automated approach like a key-word lookup or REGular EXpression (REGEX). This is because records contain: a different name for a country, a mispelled country, or a sub-country geography.

To support effective data processing, we use large language models to disambiguate countries from these free-text records. Before using our tool, tens of thousands of these records had to be reviewed manually. The tool can transform these inputs into a standard list of countries much more quickly and with a high degree of agreement with manual processing (use search engines to check the location of a place), leaving a reduced number of records to review manually.

This list of free-text is held on UKHSA’s secure network at all times. Each record is provided to a large language model through the Advanced Analytics’ team’s Janus system. The Janus system is a large language model Application Programming Interface (API) which sends encrypted text over UKHSA’s network to UKHSA’s High Performance Computer. This text is then decrypted and processed on one of UKHSA’s High Performance Computers before being re-encrypted and returned to the user with the processing complete.

The large language model is prompted (using few-shot prompting) to generate an appropriate country from the free-text or unknown. The generated country from the LLM is then disambiguated using the country converter Python package (which uses a REGEX based approached). Currently, two LLMs are run on the data Llama-3.3, and Stable-Beluga-2 (a LLama-2 variant fine-tuned by Stability AI). The reason for running two LLMs is that LLama-3 is more accurate, but Stable-Beluga-2 outputs unknown more. We prefer the system to output unknown when uncertain, so the record can be reviewed manually. The record is manually reviewed if: either of the LLMs say the country is unknown, the LLMs propose different countries, or a REGular EXpression (REGEX) match is found from the raw record that differs from the LLM output country.

The original list of records and their resulting disambiguated countries are then returned to the TB team.

2.2 - Scope

The tool helps to disambiguate geographies based on the geography freetext only. This disambiguation aims to reduce the amount of manual processing required to accurately assign country of birth to GP registration records.

The tool processes text only on place of birth inputted by the person when they registered at the GP practice. In this freetext box people normally provide a country, region, city, town or village name. No other data is provided to the model to help ascertain their country of birth (e.g. their name).

The tool is not designed to support in any other part of the data processing, or overall information structuring to support the decision about who is invited for TB screening.

2.3 - Benefit

The principal benefit of the tool is a large reduction in the number of records which require manual review. This saves significant staff time which can be better spent on planning public health action.

2.4 - Previous process

Prior to the deployment of the tool, UKHSA staff created their own lookup of freetext to country pairs. Any records that could not be matched with this lookup, were reviewed manually, using search engines to try and determine a country of birth. New freetext to country pairs that people manually identified would be added to the lookup. This process was very time consuming, and risked the lookup going out of date (albeit very slowly) as countries and borders change over time.

2.5 - Alternatives considered

As above. Prior to the LLM tool, a historic country lookup was used to identify countries of birth.

Other options considered were a purely REGEX driven approach, which matches key parts of syntax to classify it to a given country. This might work for variations of countries, but would not handle all mispellings and would definitely not handle sub-country geographies. A key term search based on a gazetteer was also considered, but coverage and variations in spellings and typos make this approach highly imperfect.

As such, an LLM is used as other language-processing approaches do not have the necessary flexibility to accurately assign countries to place of birth data across the entire world.

Tier 2 - Decision making Process

3.1 - Process integration

The vast majority of records mention a single country in a recognisable form directly when they sign up for the GP. An internal analysis showed that pre-deduplication more than 95% of records can be matched by simple methods like REGEX. Even once deduplicated around 50% of records can be matched by this type of straightforward approach (57% in that last batch processed).

However, the minority of harder to classify freetext entries still represent c.10,000 records per quarter. This is where the LLM-powered tool is used to disambiguate geographies into countries.

The tool does not directly make decisions, it supports the processing of one data input. The LLM system agrees with a human manual annotation or outputs unknown (meaning human review is necessary) in over 90% of cases. Therefore <0.5% of total records are expected to have a different country of birth allocated because of use of the tool.

The most likely records to be changed are sub-national geographies which exist in more than one country, which might be allocated to one or other of those countries, or unknown, by a manual process versus the tool. Often either of those countries would be in the same group with respect to TB screening (for instance, Punjab is a region in both India and Pakistan, both of which are high TB prevalence countries).

The final decision about offering TB screening is based on additional criteria than just country of birth. These additional factors include time since entry, age, history of TB or Latent TB, previous testing, which are not covered by the scope of the tool. Once offered by NHS England, this post-entry TB screening is optional.

3.2 - Provided information

Once the place of birth information has been processed by the LLM, a list of matched countries of birth is returned to the TB team. They then check any disagreements or unknowns and integrate their manually curated data back into their other data. This is supplied as a CSV file of ID number and country of birth.

3.3 - Frequency and scale of usage

The GP registration records are processed quarterly and each quarter approximately 10,000 unique records are not matchable by simple automated methods and need the tool (previously would have been reviewed manually). This is the number of records processed by the tool. The tool is used by the Advanced Analytics team on behalf of the TB Unit, no member of the public interacts with the tool directly.

3.4 - Human decisions and review

For the records where LLMs and REGEX are internally consistent and neither LLM says unknown, a human reviews a sample of 100 records to check if accuracy in this batch is at least 90%, a lower bound on the accuracy based on the micro-F1 estimated in evaluation (Harris et al, 2024: https://arxiv.org/abs/2405.14766)..) These countries are assumed to be correct, and fed into the remainder of the decision process.

Any uncertain records (where LLMs disagree with each other or produce a different country from a REGEX match on the raw text) are reviewed manually by the TB Team.

3.5 - Required training

All UKHSA staff are required to undertake mandatory information governance and privacy training. The Data Scientist who runs the tool has c.10 years of experience at the intersection of health and analytics and was involved in project ideation, model evaluation, tool development and continued operation. A second Data Scientist with detailed knowledge of the tool is also available to run it if needed. No specific training, past this expertise, is required to run the tool. The code is available to the whole Advanced Analytics team and is well documented.

3.6 - Appeals and review

N/A. Information does not go to the general public.

Tier 2 - Tool Specification

4.1.1 - System architecture

The Janus system is a large language model (LLM) application programming interface (API) which sends encrypted text over UKHSA’s network to UKHSA’s High Performance Computer. This text is then decrypted and processed on the UKHSA’s High Performance Computer by the LLM before being re-encrypted and returned to the user.

Specifically for this project, the LLM is prompted with few-shot prompting to generate an appropriate country from the freetext or unknown. The list of generated countries from the LLM are then disambiguated using the country converter Python package (which uses a regex based approach). Currently, two LLMs are run on the data Llama-3.3, and Stable-Beluga-2 (a LLama-2 variant fine-tuned by Stability AI). The reason for running two LLMs is that LLama-3.3 is more accurate, but Stable-Beluga-2 outputs “unknown” more. We prefer the system to output “unknown” when uncertain, so the record can be reviewed manually.

Alongside this LLM processing, the Python country converter package is used on the raw text to get a REGular EXpression (REGEX) match (where possible).

The record is manually reviewed if: either of the LLMs say the country is “unknown”, the LLMs propose different countries, or a REGEX match is found that differs from the LLM output country.

See Harris et al (2024) https://arxiv.org/abs/2405.14766 for full model deployment

4.1.2 - Phase

Production

4.1.3 - Maintenance

Any time major updates are made on the Janus system (e.g. a new model is launched) the model performance is evaluated on the list of validation sets (benchmarks) set out in Harris et al (2024) https://arxiv.org/abs/2405.14766, which includes the validation set for this project. That means we have in-deployment statistical performance measures to ensure the Janus system (and models within it) are working as expected. We use greedy sampling to increase the probability of the system performing deterministically for a given record, which minimises the probability of unexpected changes. The model deployment is not changed, without testing, minimising the chance of model drift. The Data Scientist running the tool monitors for data drift. We also manually review at least 100 records every time the model is run to validate that the statistical performance is approximately that which was found in previously evaluations.

4.1.4 - Models

The model used in the tool is the Meta Llama-3.3 Large Language Model (LLM) and Stability.AI’s fine-tuned Llama-2 variant Stable-Beluga-2 is run as a backup.

https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/

https://huggingface.co/stabilityai/StableBeluga2

Tier 2 - Model Specification

4.2.1 - Model name

Meta Llama-3.3 https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/

4.2.2 - Model version

3.3

4.2.3 - Model task

The model is a Large Language Model (LLM) used to read, interpret, and generate text in a human-like manner.

4.2.4 - Model input

A 22 character record with freetext data entered by a user on their place of birth information.

4.2.5 - Model output

The assessed country of birth described in the freetext according to the LLM.

4.2.6 - Model architecture

Transformer https://arxiv.org/abs/2407.21783

4.2.7 - Model performance

Harris et al (2024) https://arxiv.org/abs/2405.14766 sets out the performance of this task (called Country Disambiguation in the pre-print).

It shows that Llama-3.3 achieves a micro-F1 of 0.92. Stable-Beluga-2 achieves a micro-F1 of 0.86. While not quantified in this pre-print, we find that Stable-Beluga-2 selects “Unknown” more which supports more human intervention in uncertain records. We also find that Stable-Beluga-2 gets some records correct that Llama-3.3 gets incorrect.

So the combined performance of the tool should be higher than 0.92, even before the human intervenes to decide on uncertain records.

Please note that all of these scores only refer to the c.5% of records that cannot be assigned to a country trivially via REGEX or key-word searches.

4.2.8 - Datasets

A manually annotated subset of the Flag 4 GP Registration Data - Country of Birth variable

4.2.9 - Dataset purposes

The large language models we use are never trained on the dataset, the dataset is only used to test how well the model replicates human annotations (leaving the model state unaffected).

Tier 2 - Data Specification

4.3.1 - Source data name

Flag 4 GP registration data

4.3.2 - Data modality

Text

4.3.3 - Data description

Geography - names of places, country of birth provided at the point of registering at the GP

4.3.4 - Data quantities

Just the country of birth entry for 8,000 records, along with manually allocated countries. 1,600 records were used for prompt optimisation (train) and 6,400 were used to quantify accuracy (test).

4.3.5 - Sensitive attributes

None

4.3.6 - Data completeness and representativeness

N/A

4.3.7 - Source data URL

N/A

4.3.8 - Data collection

NHS England dataset provided for testing and treating new entrant migrants for latent TB infection

4.3.9 - Data cleaning

Deduplicated

4.3.10 - Data sharing agreements

N/A - no personally identifiable information (PII) is shared between teams

4.3.11 - Data access and storage

The Flag 4 GP registration data arrives in a dedicated NHS MESH inbox, accessible to LTBI (Latent TB Infection) team members only. The individual DAT data files are then downloaded and stored on a secure UKHSA drive, accessible only by the LTBI team. The individual DAT files are combined into a single dataset for analysis and cleaning. The Latent TB Infection Programme started in 2015 and UKHSA have permissions to store data for 20 years. MOU, CAP REG, DPIA’s are in place and assigned to an asset risk register. The LTBI Team, databasedeveloper and Caldicot Guardian are responsible for its storage. Country of birth is deduplicated in this dataset and this variable alone is shared with the Advanced Analytics team.

Tier 2 - Risks, Mitigations and Impact Assessments

5.1 - Impact assessment

We do not process any personally identifiable information (PII) using this tool; because we only process a country name, or other geography name - that is not a person’s address - so it is not PII, and is out of the scope of UK GDPR. This processing has however been added to the existing DPIA for completeness.

We have performed various internal analyses, quantifying the number of records where this tool contributes to their data processing (c.5%). And quantified the potential number of records where the LLM driven tool may vary from a manual process (<0.5%).

5.2 - Risks and mitigations

The main risk is that the tool will output incorrect countries for a given input. This could occur due to:

  1. Poor model performance
  2. Model drift (where a previously working model stops performing)
  3. Data drift (where data changes from the data the model was validated on)

In order to mitigate poor performance we have a comprehensive validation set used to assess the LLMs used in the tool, to quantify the level of performance on known data.

To mitigate model drift, we rerun this validation whenever the model deployment code is changed. We also run a REGEX based approach as an additional check, and run a second LLM in order to identify additional uncertain records for manual review. Finally, we validate a random sample of records for which the tool has generated a country to ensure performance is in line with expectations.

We monitor data drift by manually reviewing a portion of the new data each quarter to ensure the data is consistent with previous rounds.

Updates to this page

Published 25 September 2025