Organisational Listening Tool
The Organisational Listening Tool (OLT) collates, classifies and displays customer feedback from HMRC digital, webchat and telephony services.
1. Summary
1 - Name
Organisational Listening Tool
2 - Description
The Organisational Listening Tool (OLT) collates, classifies and displays customer feedback from HMRC digital, webchat and telephony services. The tool allows HMRC to listen to the voice of the customer and gather insight into their experience throughout the different channels of communication. Feedback and classified results will help service owners make enhancements based on comments, user interactions and summaries that are presented in the OLT.
3 - Website URL
N/A
4 - Contact email
Tier 2 - Owner and Responsibility
1.1 - Organisation or department
HMRC
1.2 - Team
Data Science / Data Exploitation
1.3 - Senior responsible owner
Principal Data Scientist
1.4 - Third party involvement
No
1.4.1 - Third party
N/A
1.4.2 - Companies House Number
N/A
1.4.3 - Third party role
N/A
1.4.4 - Procurement procedure type
N/A
1.4.5 - Third party data access terms
N/A
Tier 2 - Description and Rationale
2.1 - Detailed description
Feedback data, including scores and text data are processed. They are then classified using a set of classification models, with the set of models changing depending on the service for which the feedback is given, for example “is a piece of customer feedback relating to authentication?”. They are also grouped using unsupervised model techniques, with the intention of capturing topics that may exist outside of the supervised labels, for example “Issues with debit card payments”. Data are then displayed in an R Shiny dashboard, alongside summarised scores, and can be accessed by users with access via role requests.
2.2 - Benefits
Monitoring of HMRC services to listen to and summarise feedback ultimately used to improve the quality of services. Outputs from labelled data can be used for thematic analysis of comments over the individual channels or microservice owners can also filter outputs to inspect satisfaction, neteasy and able to do scores or classified survey comments for their particular microservice. OLT has assisted in drastically improving customer service, allowing HMRC to quickly respond to customer issues.
2.3 - Previous process
N/A
Tier 2 - Deployment Context
3.1 - Integration into broader operational process
This processing enhances our insight into customer experience of HMRC services, helping us to improve our service offering and understand the impact of changes. This benefits HMRC customers with a better customer service experience.
For HMRC the benefits are:
- Insight into customer feedback that would otherwise not be possible, as we cannot read and digest 30,000 free text comments per week.
- The results are used by HMRC subject matter experts to inform their decision making about our services.
3.2 - Human review
Suggestions and bug reports are submitted by users via the tool, and development of the tool is discussed with a group of key users fortnightly. Unsupervised models are reviewed and refreshed every 3 months, and new topics are added if new distinct clusters of documents are formed. Labelled data has been used to review performance of supervised models. Labelled data has been created by subject matter experts, inspecting comments and assigning appropriate theme labels.
3.3 - Frequency and scale of usage
250 views per month over the last 30 days, 185 distinct users. 100,000 pieces of customer feedback each with with 36% of comments filled in with text over the past week.
3.4 - Required training
N/A
3.5 - Appeals and review
N/A
2.4 - Alternatives considered
N/A
Tier 2 - Tool Specification
4.1.1 - System architecture
Data are extracted from HMRC data sources. Data are then pre-processed in a data pipeline, which includes removal of personally identifiable information, classification via supervised machine learning process and topic modelling via unsupervised clustering. Data are summarised, topic labels and classifiers are added to the data, and then summarised results are hosted on an R Shiny dashboard.
4.1.2 - System-level input
Text data, customer satisfaction, easy to do, neteasy, channel, service and microservice.
4.1.3 - System-level output
Labelled themes, topics and summarised scores.
4.1.4 - Maintenance
Supervised: Currently in the process of reworking the modelling, given new themes and techniques. Unsupervised: BERTopic models are retrained every 3 months, with scope to change parameters every retrain.
4.1.5 - Models
Supervised text classification of feedback comments using svmLinear3, rpart2, gbm, random forest. Unsupervised topic modelling uses BERTopic with multi-qa-MiniLM-L6-cos-v1 (self-hosted), UMAP and HDBscan. Generation of topic labels from BERTopic uses Meta-Llama-3-70B (self-hosted).
Tier 2 - Model Specification: Supervised text classification (1)
4.2.1. - Model name
Supervised text classification
4.2.2 - Model version
9.3.0
4.2.3 - Model task
Classification of whether or not a comment falls under a certain theme (e.g. wait time, authentication, usability etc.)
4.2.4 - Model input
Text data, customer satisfaction, easy to do, neteasy, channel, service and microservice.
4.2.5 - Model output
Theme labels which are added to feedback data.
4.2.6 - Model architecture
We use a number of models dependent on service for the purpose of multi-classification. We have 3 different groups of models, relevant to Business Tax Account, Personal Tax Account and Telephony, as each different service will have different feature labels and input data. Other services will also be assigned a model from these groups.
For each theme label, a prediction will be made, “does this comment relate to this theme?”, if a positive prediction is made, then the topic label is added.
Features are derived from model input parameters, in addition to tokenised text. Predictions are then made on new data each week.
4.2.7 - Model performance
A labelled dataset has been created for each model family, in which comments have been tagged with theme labels by subject matter experts. These data can then be used to assess model performance.
Distribution of theme labels has been compared, and analysis of features has been carried out. Highly correlated features have been removed. For different candidate models, we have compared metrics of accuracy, Cohen’s kappa, precision, recall and F1 score.
4.2.8 - Datasets and their purposes
Datasets used for model training are detailed in development data.
Tier 2 - Model Specification: Unsupervised topic modelling (2)
4.2.1. - Model name
Unsupervised topic modelling
4.2.2 - Model version
9.3.0
4.2.3 - Model task
BERTopic: Represent feedback comments as a high-dimensional vector, reduce dimensionality and subsequently cluster. Large Language model: Generate topic labels based on representative words and comments of clusters.
4.2.4 - Model input
Text feedback data
4.2.5 - Model output
A list of topics, with genAI labels. Comments are then assigned a single topic.
4.2.6 - Model architecture
Process is further described in https://maartengr.github.io/BERTopic/algorithm/algorithm.html
Text data is used to create document embeddings, which represent the text data numerically. We then reduce the dimensionality of these document embeddings, and create clusters (topics). We then take representative documents and words from these topics, and insert them into a prompt to form topic labels using generative AI.
Every 6 months, model is updated by forming topics based on new feedback comments, and then using the merge models process (https://maartengr.github.io/BERTopic/getting_started/merge/merge.html) to see if there are any new distinct clusters.
4.2.7 - Model performance
Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation. Feedback was taken from users before full model release, and topics labels can be edited based on user feedback.
In model development, visualisations and reviews of data were used to assess model suitability, including a document map to inspect formed clusters, the percentage of comments that were unclassified and topic size.
4.2.8 - Datasets and their purposes
Corpus used to create BERTopic model is listed in developmental data.
2.4.3. Development Data
4.3.1 - Development data description
For unsupervised model training, the entire 2 year dataset is used to generate topics, we use a corpus that spans all services.
For supervised model training, we have 3 different training sets for Personal Tax Account, Business Tax Account and Telephony. Data described is relating to the latest set of models that have been trained (telephony) however process and data are similar between the three.
4.3.2 - Data modality
Text Data
4.3.3 - Data quantities
Labelled supervised data: 5,000 records, with 1000 records in train split. Data for topic modelling: 4,000,000 records
4.3.4 - Sensitive attributes
The data are processed in R and Python, which includes our PII process: any survey text that contains anything that could be PII is redacted in full. We do this by looking for patterns that match:
- NINOs
- Telephone numbers
- Addresses
- Postcodes
- Money amounts
- Any sort of identifier (viz. a sequence of six or more numbers in a row, which would include any UTR, sort code, or bank account number)
- HTML tags
Any survey comment that matches any one of these patterns is removed in full. This is a cautious approach, and because the instructions about not including PII are clear and the vast majority of customers comply this proves to be effective.
4.3.5 - Data completeness and representativeness
In producing the models, each theme is treated separately. That is, we end up with a separate classifier for each theme. This means we are always working with a different ratio of data in every case. Also, since a single comment can have multiple classifications the total fraction of the data in each theme sums to greater than 1. To deal with this we can use a combination of undersampling the majority class and oversampling the minority class to achieve a good sample balance while maximising the amount of data fed into the model. To achieve this we will use the ovun.sample function from the ROSE package.
Unsupervised modelling process uses entire dataset as part of initial model train, new records are included as part of model retrain process.
4.3.6 - Data cleaning
Personal data can be contained in free text responses. We attempt to remove personally identifiable information (PII) by detection using regex, and then redacting the entire comment if PII is detected.
Text used in model train is also tokenised, stemmed and lemmatised as part of pre-processing.
4.3.7 - Data collection
Exit survey data originating from Telephony, Digital, HMRC app, webchat and Digital Assistant services.
4.3.8 - Data access and storage
The customer exit surveys are collected through HMRC’s digital systems. Once we extract those data from the source system all subsequent processing takes place on our secure data analytics platform (DAP). The data are stored in the DAP in a folder controlled through a distribution list. No other users of the DAP can read or modify those data.
The distribution list that controls access to the data currently has 10 members. No other DAP users can read or modify those data.
4.3.9 - Data sharing agreements
N/A
Tier 2 - Operational Data Specification
4.4.1 - Data sources
Weekly uploads of exit survey data originating from Telephony, Digital, HMRC app, webchat and Digital Assistant services.
4.4.2 - Sensitive attributes
Personal data can be contained in free text responses. We attempt to remove personally identifiable information (PII) by detection using regex, and then redacting the entire comment if PII is detected. Otherwise, no personal data is collected for the purpose of modelling.
4.4.3 - Data processing methods
The data are processed in R and Python, which includes our PII process: any survey text that contains anything that could be PII is redacted in full. We do this by looking for patterns that match:
- NINOs
- Telephone numbers
- Addresses
- Postcodes
- Money amounts
- Any sort of identifier (viz. a sequence of six or more numbers in a row, which would include any UTR, sort code, or bank account number)
- HTML tags
Any survey comment that matches any one of these patterns is removed in full. This is a cautious approach, and because the instructions about not including PII are clear and the vast majority of customers comply this proves to be effective.
4.4.4 - Data access and storage
The customer exit surveys are collected through HMRC’s digital systems. Once we extract those data from the source system all subsequent processing takes place on our secure data analytics platform (DAP). The data are stored in the DAP in a folder controlled through a distribution list. No other users of the DAP can read or modify those data.
Only two years’ data are stored: each week we delete any records older than that as part of our data processing steps.
The distribution list that controls access to the data currently has 10 members. No other DAP users can read or modify those data. The data are not shared directly, although the results are summarised in the OLT webapp. This is only accessible by users with the required SRS role, and then only from the STRIDE network.
4.4.5 - Data sharing agreements
N/A
Tier 2 - Risks, Mitigations and Impact Assessments
5.1 - Impact assessments
Latest DPIA has been completed May 2025, otherwise personal data including protected characteristics are not used as part of this algorithm.
5.2 - Risks and mitigations
One of the key risks in the tool is personally identifiable information (PII) (for example National Insurance numbers), being submitted in comment data. We mitigate this risk by attempting to detect this in pre-processing and redacting coments with PII detected.
Data access is controlled via a distribution list, no other users can read or modify those data. The data are not shared directly, although the results are summarised in the OLT webapp. This is only accessible by users with the required service role, and then only from the STRIDE network.