Organisational Listening Tool

The Organisational Listening Tool (OLT) collates, classifies and displays customer feedback from HMRC digital, webchat and telephony services.

From:: Cabinet Office, Department for Science, Innovation and Technology and Government Digital Service
Published: 3 June 2026

Organisation:: HM Revenue & Customs
Organisation type:: Non-ministerial department
Function:: General public services
Phase:: Production
Region:: UK
Date published:: 3 June 2026
ATRS version:: v4.0

1. Summary

1 - Name

Organisational Listening Tool

2 - Description

The Organisational Listening Tool (OLT) collates, classifies and displays customer feedback from HMRC digital, webchat and telephony services. The tool allows HMRC to listen to the voice of the customer and gather insight into their experience throughout the different channels of communication. Feedback and classified results will help service owners make enhancements based on comments, user interactions and summaries that are presented in the OLT.

3 - Website URL

N/A

4 - Contact email

atrs-team@hmrc.gov.uk

Tier 2 - Owner and Responsibility

1.1 - Organisation or department

HMRC

1.2 - Team

Data Science / Data Exploitation

1.3 - Senior responsible owner

Principal Data Scientist

1.4 - Third party involvement

1.4.1 - Third party

N/A

1.4.2 - Companies House Number

N/A

1.4.3 - Third party role

N/A

1.4.4 - Procurement procedure type

N/A

1.4.5 - Third party data access terms

N/A

Tier 2 - Description and Rationale

2.1 - Detailed description

Feedback data, including scores and text data are processed. They are then classified using a set of classification models, with the set of models changing depending on the service for which the feedback is given, for example “is a piece of customer feedback relating to authentication?”. They are also grouped using unsupervised model techniques, with the intention of capturing topics that may exist outside of the supervised labels, for example “Issues with debit card payments”. Data are then displayed in an R Shiny dashboard, alongside summarised scores, and can be accessed by users with access via role requests.

2.2 - Benefits

Monitoring of HMRC services to listen to and summarise feedback ultimately used to improve the quality of services. Outputs from labelled data can be used for thematic analysis of comments over the individual channels or microservice owners can also filter outputs to inspect satisfaction, neteasy and able to do scores or classified survey comments for their particular microservice. OLT has assisted in drastically improving customer service, allowing HMRC to quickly respond to customer issues.

2.3 - Previous process

N/A

Tier 2 - Deployment Context

3.1 - Integration into broader operational process

This processing enhances our insight into customer experience of HMRC services, helping us to improve our service offering and understand the impact of changes. This benefits HMRC customers with a better customer service experience.

For HMRC the benefits are:
- Insight into customer feedback that would otherwise not be possible, as we cannot read and digest 30,000 free text comments per week.
- The results are used by HMRC subject matter experts to inform their decision making about our services.

3.2 - Human review

Suggestions and bug reports are submitted by users via the tool, and development of the tool is discussed with a group of key users fortnightly. Unsupervised models are reviewed and refreshed every 3 months, and new topics are added if new distinct clusters of documents are formed. Labelled data has been used to review performance of supervised models. Labelled data has been created by subject matter experts, inspecting comments and assigning appropriate theme labels.

3.3 - Frequency and scale of usage

250 views per month over the last 30 days, 185 distinct users. 100,000 pieces of customer feedback each with with 36% of comments filled in with text over the past week.

3.4 - Required training

N/A

3.5 - Appeals and review

N/A

2.4 - Alternatives considered

N/A

Tier 2 - Tool Specification

4.1.1 - System architecture

Data are extracted from HMRC data sources. Data are then pre-processed in a data pipeline, which includes removal of personally identifiable information, classification via supervised machine learning process and topic modelling via unsupervised clustering. Data are summarised, topic labels and classifiers are added to the data, and then summarised results are hosted on an R Shiny dashboard.

4.1.2 - System-level input

Text data, customer satisfaction, easy to do, neteasy, channel, service and microservice.

4.1.3 - System-level output

Labelled themes, topics and summarised scores.

4.1.4 - Maintenance

Supervised: Currently in the process of reworking the modelling, given new themes and techniques. Unsupervised: BERTopic models are retrained every 3 months, with scope to change parameters every retrain.

4.1.5 - Models

Supervised text classification of feedback comments using svmLinear3, rpart2, gbm, random forest. Unsupervised topic modelling uses BERTopic with multi-qa-MiniLM-L6-cos-v1 (self-hosted), UMAP and HDBscan. Generation of topic labels from BERTopic uses Meta-Llama-3-70B (self-hosted).

Tier 2 - Model Specification: Supervised text classification (1)

4.2.1. - Model name

Supervised text classification

4.2.2 - Model version

9.3.0

4.2.3 - Model task

Classification of whether or not a comment falls under a certain theme (e.g. wait time, authentication, usability etc.)

4.2.4 - Model input

Text data, customer satisfaction, easy to do, neteasy, channel, service and microservice.

4.2.5 - Model output

Theme labels which are added to feedback data.

4.2.6 - Model architecture

We use a number of models dependent on service for the purpose of multi-classification. We have 3 different groups of models, relevant to Business Tax Account, Personal Tax Account and Telephony, as each different service will have different feature labels and input data. Other services will also be assigned a model from these groups.

For each theme label, a prediction will be made, “does this comment relate to this theme?”, if a positive prediction is made, then the topic label is added.

Features are derived from model input parameters, in addition to tokenised text. Predictions are then made on new data each week.

4.2.7 - Model performance

A labelled dataset has been created for each model family, in which comments have been tagged with theme labels by subject matter experts. These data can then be used to assess model performance.

Distribution of theme labels has been compared, and analysis of features has been carried out. Highly correlated features have been removed. For different candidate models, we have compared metrics of accuracy, Cohen’s kappa, precision, recall and F1 score.

4.2.8 - Datasets and their purposes

Datasets used for model training are detailed in development data.

Tier 2 - Model Specification: Unsupervised topic modelling (2)

4.2.1. - Model name

Unsupervised topic modelling

4.2.2 - Model version

9.3.0

4.2.3 - Model task

BERTopic: Represent feedback comments as a high-dimensional vector, reduce dimensionality and subsequently cluster. Large Language model: Generate topic labels based on representative words and comments of clusters.

4.2.4 - Model input

Text feedback data

4.2.5 - Model output

A list of topics, with genAI labels. Comments are then assigned a single topic.

4.2.6 - Model architecture

Process is further described in https://maartengr.github.io/BERTopic/algorithm/algorithm.html

Text data is used to create document embeddings, which represent the text data numerically. We then reduce the dimensionality of these document embeddings, and create clusters (topics). We then take representative documents and words from these topics, and insert them into a prompt to form topic labels using generative AI.

Every 6 months, model is updated by forming topics based on new feedback comments, and then using the merge models process (https://maartengr.github.io/BERTopic/getting_started/merge/merge.html) to see if there are any new distinct clusters.

4.2.7 - Model performance

Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation. Feedback was taken from users before full model release, and topics labels can be edited based on user feedback.

In model development, visualisations and reviews of data were used to assess model suitability, including a document map to inspect formed clusters, the percentage of comments that were unclassified and topic size.

4.2.8 - Datasets and their purposes

Corpus used to create BERTopic model is listed in developmental data.

2.4.3. Development Data

4.3.1 - Development data description

For unsupervised model training, the entire 2 year dataset is used to generate topics, we use a corpus that spans all services.

For supervised model training, we have 3 different training sets for Personal Tax Account, Business Tax Account and Telephony. Data described is relating to the latest set of models that have been trained (telephony) however process and data are similar between the three.

4.3.2 - Data modality

Text Data

4.3.3 - Data quantities

Labelled supervised data: 5,000 records, with 1000 records in train split. Data for topic modelling: 4,000,000 records

4.3.4 - Sensitive attributes

The data are processed in R and Python, which includes our PII process: any survey text that contains anything that could be PII is redacted in full. We do this by looking for patterns that match:
- NINOs
- Telephone numbers
- Addresses
- Postcodes
- Money amounts
- Any sort of identifier (viz. a sequence of six or more numbers in a row, which would include any UTR, sort code, or bank account number)
- HTML tags

Any survey comment that matches any one of these patterns is removed in full. This is a cautious approach, and because the instructions about not including PII are clear and the vast majority of customers comply this proves to be effective.

4.3.5 - Data completeness and representativeness

In producing the models, each theme is treated separately. That is, we end up with a separate classifier for each theme. This means we are always working with a different ratio of data in every case. Also, since a single comment can have multiple classifications the total fraction of the data in each theme sums to greater than 1. To deal with this we can use a combination of undersampling the majority class and oversampling the minority class to achieve a good sample balance while maximising the amount of data fed into the model. To achieve this we will use the ovun.sample function from the ROSE package.

Unsupervised modelling process uses entire dataset as part of initial model train, new records are included as part of model retrain process.

4.3.6 - Data cleaning

Personal data can be contained in free text responses. We attempt to remove personally identifiable information (PII) by detection using regex, and then redacting the entire comment if PII is detected.

Text used in model train is also tokenised, stemmed and lemmatised as part of pre-processing.

4.3.7 - Data collection

Exit survey data originating from Telephony, Digital, HMRC app, webchat and Digital Assistant services.

4.3.8 - Data access and storage

The customer exit surveys are collected through HMRC’s digital systems. Once we extract those data from the source system all subsequent processing takes place on our secure data analytics platform (DAP). The data are stored in the DAP in a folder controlled through a distribution list. No other users of the DAP can read or modify those data.

The distribution list that controls access to the data currently has 10 members. No other DAP users can read or modify those data.

N/A

Tier 2 - Operational Data Specification

4.4.1 - Data sources

Weekly uploads of exit survey data originating from Telephony, Digital, HMRC app, webchat and Digital Assistant services.

4.4.2 - Sensitive attributes

4.4.3 - Data processing methods

4.4.4 - Data access and storage

Only two years’ data are stored: each week we delete any records older than that as part of our data processing steps.

The distribution list that controls access to the data currently has 10 members. No other DAP users can read or modify those data. The data are not shared directly, although the results are summarised in the OLT webapp. This is only accessible by users with the required SRS role, and then only from the STRIDE network.

N/A

Tier 2 - Risks, Mitigations and Impact Assessments

5.1 - Impact assessments

Latest DPIA has been completed May 2025, otherwise personal data including protected characteristics are not used as part of this algorithm.

5.2 - Risks and mitigations

One of the key risks in the tool is personally identifiable information (PII) (for example National Insurance numbers), being submitted in comment data. We mitigate this risk by attempting to detect this in pre-processing and redacting coments with PII detected.

Data access is controlled via a distribution list, no other users can read or modify those data. The data are not shared directly, although the results are summarised in the OLT webapp. This is only accessible by users with the required service role, and then only from the STRIDE network.

Published 3 June 2026

Contents

Cookies on GOV.UK

Organisational Listening Tool

1. Summary

1 - Name

2 - Description

3 - Website URL

4 - Contact email

Tier 2 - Owner and Responsibility

1.1 - Organisation or department

1.2 - Team

1.3 - Senior responsible owner

1.4 - Third party involvement

1.4.1 - Third party

1.4.2 - Companies House Number

1.4.3 - Third party role

1.4.4 - Procurement procedure type

1.4.5 - Third party data access terms

Tier 2 - Description and Rationale

2.1 - Detailed description

2.2 - Benefits

2.3 - Previous process

Tier 2 - Deployment Context

3.1 - Integration into broader operational process

3.2 - Human review

3.3 - Frequency and scale of usage

3.4 - Required training

3.5 - Appeals and review

2.4 - Alternatives considered

Tier 2 - Tool Specification

4.1.1 - System architecture

4.1.2 - System-level input

4.1.3 - System-level output

4.1.4 - Maintenance

4.1.5 - Models

Tier 2 - Model Specification: Supervised text classification (1)

4.2.1. - Model name

4.2.2 - Model version

4.2.3 - Model task

4.2.4 - Model input

4.2.5 - Model output

4.2.6 - Model architecture

4.2.7 - Model performance

4.2.8 - Datasets and their purposes

Tier 2 - Model Specification: Unsupervised topic modelling (2)

4.2.1. - Model name

4.2.2 - Model version

4.2.3 - Model task

4.2.4 - Model input

4.2.5 - Model output

4.2.6 - Model architecture

4.2.7 - Model performance

4.2.8 - Datasets and their purposes

2.4.3. Development Data

4.3.1 - Development data description

4.3.2 - Data modality

4.3.3 - Data quantities

4.3.4 - Sensitive attributes

4.3.5 - Data completeness and representativeness

4.3.6 - Data cleaning

4.3.7 - Data collection

4.3.8 - Data access and storage

4.3.9 - Data sharing agreements

Tier 2 - Operational Data Specification

4.4.1 - Data sources

4.4.2 - Sensitive attributes

4.4.3 - Data processing methods

4.4.4 - Data access and storage

4.4.5 - Data sharing agreements

Tier 2 - Risks, Mitigations and Impact Assessments

5.1 - Impact assessments

5.2 - Risks and mitigations

Updates to this page

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK