DSIT: Consult
Consult is a new internal tool offering civil service departments high-quality, assured AI topic modelling capabilities for consultation processing.
Tier 1 Information
1 - Name
Consult
2 - Description
The i.AI team is developing a tool to make the process of analysing public responses to government consultations faster and less expensive. Consult uses AI to automatically extract patterns and topics from the responses, and turns them into dashboards for policy makers.
3 - Website URL
https://ai.gov.uk/projects/consult/
4 - Contact email
Tier 2 - Owner and Responsibility
1.1 - Organisation or department
Department for Science, Innovation and Technology
1.2 - Team
Incubator for AI
1.3 - Senior responsible owner
Director of the Incubator for Artificial Intelligence (i.AI)
1.4 - External supplier involvement
No
1.4.1 - External supplier
N/A
1.4.2 - Companies House Number
N/A
1.4.3 - External supplier role
N/A
1.4.4 - Procurement procedure type
N/A
1.4.5 - Data access terms
N/A
Tier 2 - Description and Rationale
2.1 - Detailed description
The Consult processing pipeline is a series of LLM prompts designed to work together in a pipeline. These prompts are broadly defined as capabilities and can be separated into: sentiment analysis, topic generation and topic mapping.
The pipeline steps can be further broken down as:
- Sentiment analysis - performed on free-text responses to understand agreement or disagreement with respect to a question or proposal.
- Topic Generation - sentiment analysis results are paired with the responses to inform a topic generation step. Topics may or may not include evaluative language at this stage.
- Topic Consolidation - generated topics are de-duplicated; semantically similar topics will be grouped.
- Topic Refinement - topics containing evaluative language synonymous with a stance (agreement or disagreement), are refined to produce a neutral topic framework
- Topic Mapping - topics are mapped to responses from the neutral topic framework.
As an example, if you process responses to a consultation on advertising unhealthy foods, Consult will generate a list of topics per question. These could include: ‘freedom of choice’, ‘descriptive food labelling’ or ‘making healthy food cheaper’.
2.2 - Scope
During a consultation a Civil Servant can review incoming responses to obtain interim responses. Once a consultation has closed to responses, a civil servant can analyse a range of free text and closed question responses to understand the thematic landscape. The tool works by responses being uploaded into the tool and visual outputs are then created that provide an initial indication of the themes within the responses. This allows a basis for further analysis of the responses and can support the creation of a theme (sometimes called a code) framework that the consultation analysts can use to begin to write a consultation report.
2.3 - Benefit
The key benefits are improving the consultation process making it more likely for the public’s voice to be heard and to influence policy, productivity gains for civil servants and reduced costs (including on consultants).
Productivity gains The standardising and automating of the initial phase of consultation analysis allows teams to access initial analytical findings quicker and in an easy to review visual format. Consult reduces the burden on civil servants to compile and categorise consultation responses, instead empowering analysts to dive deeper into the data and focus more on the in-depth consultation analysis required to draft a consultation response.
Improving the consultation process At the moment the consultation process takes a long time and there are many aspects which are prone to human error. By reliably and quickly running the initial stage of the analysis, the i.AI tool makes it easier for the policy analysts to get insights from responses, and to start to act on them as soon as the consultation has closed. Furthermore, by making the process more valuable and less costly, this tool might increase the number of consultations that are capable of being run.
Reducing costs Due to the high burden costs of analysing consultations, it is very common for the job to be outsourced to the private sector. By offering this service to teams for free, we will be saving tax payer money.
2.4 - Previous process
Currently civil servants analyse multiple consultations a year. There is no standard process for completing this analysis and it often takes teams many months to complete a comprehensive review of free text responses. These are read, themed and counted manually by teams of civil servants, or the full process, if completed by a third party supplier, can cost millions each year. Outputs of this analysis is often very text heavy and produces consultation reports that are publicly available.
2.5 - Alternatives considered
We explored using BERTopic for this analysis, and found several limitations. The topics generated by BERTopic were often difficult for the users to understand. The users couldn’t easily influence the topics being mapped (e.g. if they had their own framework) and it was difficult to label one response to multiple topics. Also there was no easy way to feed the context of the question or consultation into the analysis.
Using an LLM for this is novel, but offers a new and efficient approach.
Tier 2 - Decision making Process
3.1 - Process integration
Consult is an analytical aide tool. The outputs of Consult can be referenced within the consultation report, taking into account the wider context and analytical work done by civil servants. Consult will work as an initial stage of analysis to form a quick, easy to read summary of the themes present in the consultation. Further in-depth analysis will then be required by analysts, especially where the themes produced by Consult are either ambiguous or could benefit from subject matter expertise to interpret them.
An example of a Consult use case is when departments received large consultations with over 1000 individual responses. This generates a lot of free text answers that analysts are required to read and analyse over a short period of time. To generate an initial idea of the content of these responses the analysts could upload all the answers to Consult and see how many agreed or disagreed with the proposed policy change. This allows teams to start drafting the consultation report sooner and provides a clear view of responses that need to be explored further or common themes that have emerged across multiple responses.
3.2 - Provided information
Final outputs are currently still being considered: The current version presents a breakdown of themes found in the responses by question via Quantitative analysis and are shown via pie charts that demonstate the answers to closed questions.
3.3 - Frequency and scale of usage
Currently Consult is in early development and has a limited number of test users (for user research purposes). No citizens interact with the tool. It does not make any policy decisions.
3.4 - Human decisions and review
Consult provides high level analysis that is then reviewed and critiqued by civil servants working on the consultation. The tool makes no decisions and provides only a guide to the consultation responses, rather than definitive, in-depth consultation reports. Civil servants can use the information provide in the tool in the full consultation report after quality assuring it.
3.5 - Required training
All users are required to read a statement prior to accessing and using the tool which explains how the model and tool work.
3.6 - Appeals and review
No decisions are made by the tool, so no appeals process exists. For members of the public who wish to raise reviews or appeals about their consultation responses, each individual department will approach this differently in line with current guidance for all consultation reviews and appeals. Users of the tool have the ability to edit any of the AIs outputs if they disagree with them.
Tier 2 - Tool Specification
4.1.1 - System architecture
Our approach chains together a sequence of tasks, each focused on producing a distinct output required to analyse a consultation. These tasks all involve calling large language models (LLMs). We have developed a generic interface that allows us to do the following:
- Load a prompt template with task specific instructions (detailing how the task should be carried out).
- Insert task specific inputs into the prompt template, if these inputs are too large to fit into a single prompt they are batched and a series of prompts are produced.
- Asynchronously submitting these prompts to a LLM.
- If required we use a retry system, with an exponential back off, this allows us to robustly handle rate limit errors without disrupting the pipeline.
- Ensure the outputs conform to a predetermined JSON structure (we run integrity checks to test this).
Below we provide a more detailed breakdown of the tasks in our pipeline:
- Question Expansion: the first task enriches the short consultation question with extra context (typically this is extracted from a detailed summary of the aims of the consultation). The output of this step is a standalone version of the consultation question that is extended with relevant context.
- Sentiment Analysis: the second task runs on individual free-text responses and classifies whether or not they agree or disagree with the proposed changes. The output of this step is a classification label for each response.
- Topic Generation: the third task extracts common themes/topics from batches of responses. In order to improve performance, responses are batched using the classification labels produced by sentiment analysis. The output of this task is a list of topics per response batch.
- Topic Condensation: this task aims to combine the topics produced using different batches to a single list of topics. The process involves the de-duplication and combination of similar topics. The output is a much smaller master list of topics.
- Topic Refinement: the task is an additional step to further improve the quality of the topics. It enforces certain linguistic and content constraints on how topics are wording (to better meet user needs). The output is a final list of topics extracted from the responses to a question.
- Topic Mapping: the final stage of our pipeline analyses each of the responses to a question and classifies which of the topics it relates to. In addition to classification this process also provides a reason for its choice. The output of this step is a response level classification of the topics.
We have turned this into a package which we have open sourced. It is called ThemeFinder.
We are building an app to support the consultation process and allow users to access these capabilities. This is still under development, but is also open sourced - Consult.
4.1.2 - Phase
Pre-deployment
4.1.3 - Maintenance
As we connect with more departments who are running consultations we are constantly iterating the tool and its capabilities to suit user needs. The tool will be maintained and managed by i.AI until it moves to post-deployment. At this point, the management of the service will be discussed with wider government digital teams to agree how this should be run going forward.
4.1.4 - Models
GPT-4o (Internal Azure Instance)
Tier 2 - Model Specification
4.2.1 - Model name
GPT-4o (hosted on the Azure Open AI service)
4.2.2 - Model version
2024-02-01 - API Version
4.2.3 - Model task
Find sentiment and topics among sets of free text answers and assign each answer to a theme.
4.2.4 - Model input
A list of free-text answers
4.2.5 - Model output
A list of theme assignments, where a theme assignment comprises a topic ID, and the assigned answer. There are zero-to-many assignments per answer.
4.2.6 - Model architecture
https://openai.com/index/gpt-4o-system-card/
We have developed capability to generate and map themes using LLMs. We have turned this into a package which we have open sourced. It is called ThemeFinder.
4.2.7 - Model performance
Evaluation methodology
To test the performance of Consult we ran two evaluation studies:
- Evaluated the theme generation process using synthetic data.
- Evaluated the classifying by comparing the results to human labellers using a subset of consultation responses.
Two policy advisors from a government department - ‘Human 1’ and ‘Human 2’ - who worked on the original consultation analysis were asked to complete the exercise by each reviewing 200 responses per question for 3 questions. A senior policy advisor - the “supervisor” - was asked to provide what they felt was the correct answer for each response and this was used as the ground truth to compare each labeller against.
The participants were asked to do the following:
Relevance - Mark whether or not the response answered the question
Agreement - Determine if the response was for or against the proposal
Theme - Identify themes represented in the responses
The results were assessed on several criteria:
Similarity. Did the AI agree with humans as much as humans agree with each other?
Accuracy. Using the supervisor’s answers as a ground truth, how accurate was the AI compared to humans?
Key findings
Framework. The generated theme framework is representative of themes that exist in the data and is robust to factors such as grammar and spelling errors.
Relevance - The AI performs comparably with human labellers on relevance (~95% agreement compared to supervisor). It is more generous when marking answers as relevant than humans. This is preferable as falsely marking a relevant answer as irrelevant is more problematic than the opposite.
Agreement - The AI performs comparably to the human labellers when judging the agreement of each response (~85% agreement compared to supervisor).
Themes - The AI was outperformed by one human labeller but performed comparatively well to the other human labeller on theme mapping. The AI agreed with the supervisor themes 62% of the time and humans agreed with the supervisor on 70-73% of the themes (but this understates performance because a single response can have multiple themes and accuracy doesn’t credit partial matches). The human labellers only agree with each other 62% of the time showing how much variation there is.
Efficiency - Consult is 400 times cheaper and over 120 times quicker (3 minutes vs 6 hours on this sample) and does not make procedural errors. The procedural error rate for the human labellers was almost 2.5% (29 out of 1200 responses) in the labelling test. These errors took the form of responses being partially or entirely missed out, labels being put in the wrong position, or incorrectly following the methodology.
4.2.8 - Datasets
Historic consultation data sets, shared under MOU by other government departments. We also generate synthetic data to assist with evaluation and refinement.
4.2.9 - Dataset purposes
This model has not received any further training to its core model or fine tuning. The i.AI team have have undertaken prompt engineering activities which have been evaluated and iterated on.
Tier 2 - Data Specification
4.3.1 - Source data name
Historic consultation data sets, plus synethic consultation data sets.
4.3.2 - Data modality
Text
4.3.3 - Data description
In order to evaluate our pipeline we use real and synthetic consultation data. The real data has been sourced from departments using historic consultations. The synthetic data was generated by us using our own methodology, where personas are generated and then they create responses where the ground truth themes are known.
4.3.4 - Data quantities
N/A
4.3.5 - Sensitive attributes
There is no personal data included in historic consultation data. Because we are using Open AI endpoints within Azure, none of the consultation data can be seen by Microsoft or OpenAI or be used to train future models.
4.3.6 - Data completeness and representativeness
N/A - Historic consultation data is expected to contain gaps as not all respondents will answer all questions.
4.3.7 - Source data URL
N/A
4.3.8 - Data collection
Collected via historic consultation responses.
4.3.9 - Data cleaning
We haven’t yet, but want to develop a simple process that allows users to upload the data to the tool. At the moment, we do that manually. The only pre-processing is getting the data into the right format. We don’t change any of the free text responses.
4.3.10 - Data sharing agreements
Memorandum of Understanding with government departments for use of their historical consultation data.
4.3.11 - Data access and storage
Staff: All staff in i.AI are minimum SC cleared, with several with DV. This includes all of our cloud platform team.
Cloud Hosting: All of our cloud processing is done inside of the Cabinet Office provided AWS and Azure environments which are used for all of our OFF-SEN data hosting. All of our applications, databases and networking runs in the London AWS data centre for all our work loads. We have role based permissions to control who can access what.
Network Security: We operate a universal firewall for all our application endpoints where we have individually whitelisted only government IPs (individual and ranges). This allow list can be restricted further depending on the sensitivity of the workload.
DPIA: We have been informed by DSIT Data Protection team that as we are Data Processors it is the responsibility of the Data Controllers to complete a DPIA. We provide the Data Controllers with the appropriate information to be able to complete the DPIA.
Tier 2 - Risks, Mitigations and Impact Assessments
5.1 - Impact assessment
Data Protection Impact Assessment - Completed
i.AI human comparison evaluation - Completed November 2024
Each department that will use this tool and provide it with their consultation outputs will need to be complete their own DPIA.
For each department we work with, we have a signed memorandum of understanding or service level agreement.
5.2 - Risks and mitigations
Proprietary models may have inherent biases based on the data they are trained on. Where consultation data includes sensitive topics (e.g. race, gender, socioeconomics), the analysis could reflect or amplify existing biases, leading to unfair or misleading interpretations. In order to mitigate this as best we can, while developing the capabilities we have been reviewing and pre-processing data carefully at input and working with domain experts in collaborating departments, to critically assess outputs from the tool.
The interpretability and transparency of proprietary models is a risk. The model’s decision-making processes are opaque, making it difficult to explain how it arrived at its conclusions. To mitigate this, Consult is provided as an assistive and supplementary tool, rather than relying on it in isolation. Its output should be combined with domain expertise, before leading to decision making. Future user interfaces of Consult will also provide clear guidance and prompts, highlighting this fact. A user’s role and ability to relabel and modify generated themes will also emphasise the human-in-the-loop aspects of this tool.
In most cases, consultations use very specialist language, including lots of acronyms and at times legal terms. There is a risk that Consult will not have the appropriate context to understand the meaning of this specialist terminology. To mitigate this, we have worked with domain experts and add consultation context to our processing pipeline, to expand on the questions and their meaning.
There are some legal and process-based concerns about how AI can be adopted in the consultations process, while preserving the need for all responses to be appropriately read and considered. We are working closely with legal colleagues and individual departments to help work through these important questions.