Office for National Statistics: Survey Assist
A tool that calls on a large language model to ask supplementary questions in the ONS's Transformed Labour Force Survey to better code Standard Industrial Code and Standard Occupational Code to survey responses.
Tier 1 Information
1 - Name
Transformed Labour Survey - SIC/SOC LLM Questions
2 - Description
The Transformed Labour Force Survey (TLFS) is a large national survey run by the Office for National Statistics (ONS) that provides a wealth of information from the public that underpins national statistics used by the government to understand the labour market such as the rate of unemployment in the country. The survey asks respondents about their job and the industry in which they work (such as a chef in the armed forces) in free text fields. Respondents sometimes do not answer these questions with sufficient detail for their job and industry to be accurately assigned to the correct Standard (SOC) Occupational Code (chef = 5434) and Standard Industrial Code (SIC) (Defence = 84220). The algorithmic tool uses a Large Language Model to provide follow-up questions based on a respondent’s input that helps to clarify what their job and industry are, and then determines what the appropriate SIC and SOC codes should be for that person’s response. This will improve the granular quality of the responses to the TLFS and thus improve the quality of the statistics produced using the survey.
3 - Website URL
Survey Assist builds on ClassifAI, an ONS-developed experimental text classification pipeline using Retrieval Augmented Generation (RAG). ClassifAI is explained here: https://datasciencecampus.ons.gov.uk/classifai-exploring-the-use-of-large-language-models-llms-to-assign-free-text-to-commonly-used-classifications/, and whilst this is not a full description of Survey Assist it explains some of the mechanics behind the algorithmic tool.
4 - Contact email
Tier 2 - Owner and Responsibility
1.1 - Organisation or department
Transformed Labour Force Survey at the Office for National Statistics
1.2 - Team
Social Surveys
1.3 - Senior responsible owner
Deputy Director on behalf of Director, of the Surveys Directorate, Office for National Statistics, and Director of Digital Services.
1.4 - External supplier involvement
No
Tier 2 - Description and Rationale
2.1 - Detailed description
When a respondent answers the questions in the Transformed Labour Force Survey they have to say what their job is and in what industry they work. This information is collected to assign Standard Industrial Classification (SIC) and Standard Occupational Classification (SOC) codes to responses. Survey Assist uses the provided input to determine what the most likely SIC (and, in time, although this functionality is not yet developed, SOC) codes are for the response given. When there is a direct match between the industry and a known SIC (for example, ‘Primary School’ maps directly to a SIC but ‘School’ does not), it directly suggests a SIC. If there is not a direct match, it takes the answers and runs them through a SIC Classification Plugin, ClassifAI, which suggests an initial SIC based on the input. Survey Assist then uses a large language model (LLM) to ask new, relevant, supplementary questions that ‘drill down’ into the initial response and ask the respondent for more clarity around what exactly their job and industry are. We are exploring, through testing, the best model for these drill down questions, but the current prototype provides one open question and one closed question. The questions will vary between participants and depend upon their answers to standard questions in the TLFS. The answers to these questions and the initial questions are then used to suggest the respondent’s SIC (and, in time, SOC) classification.
The tool is currently under development though will be tested using real people. We are testing its performance using clerically reviewed data to assess whether Survey Assist has made the correct decision. We are also planning to review the quality of questions generated by the LLM. If this testing is successful, the clerical process is not likely to be used, and Survey Assist will operate automatically. Currently, Survey Assist is a standalone product with a Graphical user interface (GUI) that can be used for testing, but the intention is that it will be embedded through an application programming interface (API) into the TLFS. Once in production, the goal is that respondents will be informed that the survey may contain AI generated questions and will then be presented with the questions in the same way as any other question through the survey interface.
At this stage, a simple setup is used where inputs are sent through an API to trigger the LLM’s response and the LLM provides its questions directly to the computer interface.
When given the respondent’s answers, the LLM searches through all SIC and SOC codes (which are publicly available information) to determine which codes best match the responses given, and then provides an output which shows what the original coding would have been without the subsequent questions and what the newly derived codes would be after the LLM questions.
2.2 - Scope
The scope of the tool is limited to: determining whether supplementary questions are needed for a given survey response; compiling the most likely useful supplementary questions; presenting those questions to the respondent in the computerised survey collection system; assessing the responses to assign the most appropriate SIC and SOC codes; and presenting the results to the analyst.
The tool will initially be used with a test sample of people but if successful it will be rolled out as part of the standard survey for the entire TLFS.
The tool is limited to use in the Transformed Labour Force Survey at present.
2.3 - Benefit
Traditionally, this data was captured with face-to-face interviewers through the Labour Force Survey, but falling response rates and a move to online methods of collection evolved the Transformed Labour Force Survey, an online-only survey that is still in beta stage and is subject to ongoing review. Whilst the longer term intention was to move to online collection for the bulk of labour market statistics, the two surveys (online and face to face) currently run in parallel whilst the TLFS explores ways of improving response rates and quality of data in its current beta stage. Survey Assist is designed to help with this by attempting to improve data quality at the point of collection, as it is not feasible (due to scale and timings) to go back to respondents after a survey has been submitted to ask for more clarity (for example: ‘what type of school do you work in?’). The tool is able to review a person’s responses and compare them against the entirety of the SIC and SOC coding structures and devise appropriate follow up questions aiming to narrow down SIC and SOC options to something that is codable at a better level of granularity (SIC can get coded to 5 digits, and SOC to 4 digits, which are the levels needed for some of the stakeholders of LFS data). If successful, the quality of Labour Market Data will be improved which will reduce bias, improve representativeness, and improve the statistics dependent on this, including employment, economic and population statistics; which will ultimately lead to better insights for government that could feed into policy decisions.
2.4 - Previous process
Previously a respondent was provided with a standardised question (the same question as is given to every respondent) to which they had to provide an answer. That answer was accepted as-is and then assigned the best fitting SIC/SOC codes based on the given response, which are often not detailed enough to allow for coding to the level of granularity required. Responses are captured in free text fields, which given the complexity of the SIC and SOC coding frames are the most appropriate (for example, look up lists would not be appropriate). If the response was unclear there was no follow-up or clarification process, which meant many respondents responses were unable to be coded.
2.5 - Alternatives considered
N/A - No further relevant tools available to undertake this task (improving data quality at the point of collection in an on-line survey) sufficiently. We have discussed with academics working in this area and are aware of other National Statistical Institutes with this problem, but no other solution has been found that does not put considerable extra burden on a respondent (for example, follow-up telephone interviews) or questioning body.
Tier 2 - Decision making Process
3.1 - Process integration
The output from the tool will be an assigned SIC and SOC code for each response. These are not considered at an individual level, but are instead amalgamated for reporting in varied outputs from the ONS on economic activity. The TLFS is a very large survey so it is not feasible or appropriate for the decisions made at a granular level to be examined every time. As such the decisions made by the tool will stand (and an evaluation module is being built within the tool to understand its reliability against a set of gold-standard coded responses, thus giving an indication of ongoing reliability of the tool’s outputs). Instead, the impact comes from later analysis of aggregated information at the population level. Here it is incumbent upon the analyst to ensure that the aggregated statistics are reliable and plausible. While not yet tested (that is to come) the decisions made by the algorithm are likely to result in a net improvement to the statistics compared with the status quo (where decisions occur automatically based on the respondent’s initial response, which is often insufficient to code at the required level of granularity for the best level of statistical outputs). As such the decisions that are made by the algorithm are simply to populate a field in the dataset - the decisions that will ultimately affect the public will be made by humans analysing the data or creating the policies to which they relate to.
3.2 - Provided information
The tool provides two outputs over three phases. The first phase determines whether a follow up question is required to better code the response on SIC and SOC. The first major output is the second phase where, if deemed appropriate to ask a follow up question, a follow up question is written by the tool. The third phase will be the core output of the tool, which will be a coded SIC/SOC classification, which takes the form of a written output. In the format used for testing this is collected on-screen showing what the original assessment indicated to be the correct SIC/SOC codes and then what the revised SIC/SOC codes are, but in the longer term this will be collected as part of survey data collection as a field against the many other items of data collected in the survey. The survey then continues as it would have normally. Ultimately each individual response will be amalgamated into a very large dataset of all survey responses, with the SIC/SOC codes containing a mixture of codes that did not require clarification after the respondents’ inputs, and those that were determined by the LLM following the supplementary questions phase. It is likely that a flag will be created alerting analysts to whether the code was created as a result of follow-up questions or not.
3.3 - Frequency and scale of usage
Initially the tool will be used as a standalone tool with a dedicated user interface for testing. We expect testing in these phases to individuals numbering in the tens (for in-person non-controlled testing, to gauge respondent reactions) and in the thousands (with an online controlled test to volunteers, to further gauge public impression and also measure the accuracy of the SIC/SOC coding). Outcomes of these test phases will govern further test stages but it is expected to be incorporated into a flow of questions (possibly a lower-reach social survey) to understand how/if it affects respondent engagement (drop-off rates). Within the TLFS survey, there are several sections where questions on job and industry are asked, but it is envisaged that this will initially only be used in one section, to reduce respondent burden (how many questions are asked/how long a survey will take) whilst making the greatest impact in improving the quality of codable data.
If the test is successful the algorithm will then be implemented into the TLFS survey, which covers households across Great Britain (different arrangements are in place in Northern Ireland, although there may be interest in the tool from Northern Ireland Statistics and Research Agency (NISRA) if its proven beneficial for survey data collection). The tool could theoretically be used with every respondent, and it is noted that the TLFS reaches around 90,000 households a year, with approximately half of those households sampled for a further four waves. There are four waves a year, so there are a maximum of 90k+45k+45k+45k+45k surveys, four times a year. Within each survey, questions ask for all household members so this tool could potentially be used millions of times a year. However the reality is that response rates to the TLFS are much lower than the number of households sampled, not every person will need the LLM for clarification of their answers, and so the true number of cases where the LLM will be needed will necessarily be smaller than the total sample size of the TLFS (though a more informed proportion will not be known until testing is complete).
3.4 - Human decisions and review
The development of the implementation of the LLM has been conducted by humans, and will have close human analysis throughout the research and development phases, but there is no direct intervention from humans in the questions that the LLM will ultimately ask, nor the results that are provided by the LLM. Clerical review (assessment of whether the LLM was correct or not) will be conducted during the research phases to ensure that the LLM is indeed operating correctly but when operationalised this will not be possible. TLFS coding is done post-collection and automated (unlike the original LFS which has some clerical coding). The clerical work as part of the test phase is to verify that the LLM is working as expected. That volume of testing will not be possible with the operationalised version due to the volume of data. Consequently, the reduction in the use of the clerical resource is through the use of TLFS instead of LFS, rather than specifically reducing the amount in the TLFS itself. This is a general efficiency of the process. Human input will also be used to judge the quality and respondent-centred nature of the questions generated by the tool during the research phases. Humans will, however, be involved in the analysis and production of statistics based on the data and it will be for the analyst to ensure that the data is plausible, that their analysis truly reflects the population and that any limitations or uncertainties are clearly described in their publication. In addition, an evaluation module is being built into the tool which will report on a gold-standard, coded data set and report on a regular basis if there are variations in its performance (for example, it will report on the number of accurate SIC and SOC matches, and the variances in questions).
3.5 - Required training
The tool will operate automatically as part of the usual survey interview process using the survey platform designed for online survey participation. Respondents will see questions asked by the tool as part of the usual flow of their survey questions. Interviewers will be trained to understand that the questions are presented by a large language model, and to supplement that when questions are presented a message will appear that advises users that a large language model or ‘AI’ is generating those particular questions. In terms of development of the tool, the Survey Team and Digital Services Team are experts with specific training in survey development, and AI and technology respectively, and they will be providing communications to help advise stakeholders of the tool and its use in the wider survey.
3.6 - Appeals and review
The decisions made by the process are predominantly which SIC/SOC codes are most likely to represent the information provided by the respondent. That decision does not directly affect an individual, and it is not information shared with individuals in the current surveys (the Labour Force Survey, where an interview assigns a likely SIC/SOC code, or the current TLFS, where an internal tool assigns likely codes). Individual respondents do not see which SIC or SOC codes they have been classified into at present, nor can they correct or amend those codes. Using the Survey Assist model the difference is that this classification is done in the survey flow, rather than as post-collection processing, but the outcome is no different for the respondents’ ability to intervene in the classification. Respondents do not confirm their final SIC code in Survey Assist; rather, this is automated behind the scenes by the LLM and recorded in the Survey Assist database. This is consistent with the current TLFS process. All TLFS coding for SIC and SOC uses a Classification Index Matching System (CIMS). CIMS matching is done post-collection. Note, the LLM is used where the information provided by the participant may not be clear enough to make an accurate assessment of the SIC/SOC in the first place so the respondent would not have seen their code assigned either way. At a population level we need to assess whether results are plausible but again this will not affect an individual. As such, once the process has been used to create the data it is not possible to review the specific records themselves and therefore there is no mechanism for the public to review the outputs and check for themselves or correct any errors other than at the time of data collection. Once the data have been compiled it may not be possible to identify errors even with access to the data because going back to a respondent to verify answers to a question to check a SIC or SOC code is not possible. There is accordingly no mechanism for appeal or challenge, but neither does there need to be as the outputs are used for population-level statistics, and not applicable to individual people or groups (indeed data at such a low level would be statistically supressed anyway and not used). Methodologically, the process is reviewed in two main ways; the survey team review the outputs to ensure they have been produced correctly by the algorithms and the development teams ensure that the LLM is implemented correctly and tested prior to operationalisation, and has ongoing evaluation modules that can be monitored by stakeholders in survey analysis.
Tier 2 - Tool Specification
4.1.1 - System architecture
SurveyAssist leverages GenAI to determine if a follow-up question would increase model confidence in the assigned industrial or occupational classification code. The tool is implemented using Retrieval-Augmented Generation (RAG) model that combines semantic search with Large Language Model (LLM) based selection. It begins by searching a knowledge base of classification descriptions, using embedded vectors to produce candidate classification codes. The LLM then evaluates these candidates against the survey response to identify if further information is needed. In that instance, LLM output is used to produce follow up survey questions, which are then served to the respondent.
ONS is committed to transparency so we are working on open GitHub repos which are currently: https://github.com/ONSdigital/survey-assist-ui https://github.com/ONSdigital/survey-assist-api https://github.com/ONSdigital/survey-assist-utils https://github.com/ONSdigital/sic-classification-library https://github.com/ONSdigital/sic-classification-utils https://github.com/ONSdigital/soc-classification-library https://github.com/ONSdigital/soc-classification-utils
4.1.2 - Phase
Pre-deployment
4.1.3 - Maintenance
The tool is still in its pilot phase, but will undergo periodic technical review once operational, and an evaluation module will be built into the tool to provide regular reporting on a fixed dataset to report on the consistency of the decision making of the tool. It will also be used to measure the impact of changes when the proprietary LLM is periodically updated. This evaluation process will highlight any issues that arise, allowing swift interventions if changes occur.
4.1.4 - Models
Pre-trained large language models provided by the external vendor (Google). These models have been assured by ONS’s Security Teams.
Tier 2 - Model Specification
4.2.1 - Model name
Gemini 1.5 currently but may move to Gemini 2.0
4.2.2 - Model version
Gemini 1.5 Flash
4.2.3 - Model task
Optimised for cost efficiency and low latency, Gemini is a Large Language Model (LLM) that can receive input as text, code, images, audio or video and respond with text output. In ONS’s use case the model uses a tailored prompt and is only ever provided text based input. The model then exercises instructions in the prompt to attempt to classify the input data following the described guidelines. The output is a list of potential classifiers and a follow up question (related to the input data) to help refine the classification.
4.2.4 - Model input
Text-based input using classification text from publicly available sources. We may augment this with internal knowledge bases.
4.2.5 - Model output
A list of potential classifier descriptors and codes along with a text based follow up question.
4.2.6 - Model architecture
Gemini is an off-the-shelf, generative pre-trained transformer with large language model capability - see documentation for further detail: https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash-lite
4.2.7 - Model performance
We are using the pre-trained Gemini models, without any form of fine-tuning. They are ready to use, out-of-the-box models. The way to enhance these models is to provide them with context (e.g. our SIC and SOC classification data), so the model uses the specialist context (by interacting with a vector store) to classify the text that is sent as input.
To improve the performance and ensure the responses are as high quality and relevant as possible, we have implemented best practice when it comes to prompt engineering (these are instructions given to the model): -giving the model a role -clear requirements on the format of the response
We are also building an evaluation module that will enable us to compare results for: consistency - does the same input get very similar results when the model changes accuracy - do the responses retrieved from Gemini align with manual classification suitability - do the follow-up questions fit good practice as defined by ONS, are the questions relevant and are the questions devoid of inappropriate language performance - is the rate of response suitable
4.2.8 - Datasets
The model is not trained on any ONS data. The solution uses the following publicly available datasets to provide context for the LLM workflow and responses:
We are using thousands of clerically coded responses from the TLFS to measure the efficacy of Survey Assist during the development phases. These data are anonymised responses to the core three questions (job/job description/industry).
4.2.9 - Dataset purposes
The publicly available SIC and SOC index datasets provide context for the LLM workflow and responses. The internal clerically coded datasets are used for evaluating the efficacy of Survey Assist and to develop personas to respond to LLM generated questions in the testing phases, to understand the impact of these questions in refining SIC classification.
Tier 2 - Data Specification
4.3.1 - Source data name
This solution uses the following publicly available datasets to provide context for the LLM workflow and responses:
We are using thousands of clerically coded responses from the TLFS to measure the efficacy of Survey Assist during the development phases. These data are anonymised responses to the core three questions (job/job description/industry). These evaluation datasets are bespoke datasets internal to ONS.
4.3.2 - Data modality
Text
4.3.3 - Data description
The publicly available datasets are the descriptions and known examples of activities coded to standard industrial classification (SIC).
Internal datasets are free-text answers to follow up questions about people’s occupations and job industries. In addition, we numerically coded categorical variables that assign every job and industry to a 4 or 5 digit code.
The data are stored in a cloud-based, highly-secure ONS data store.
4.3.4 - Data quantities
As discussed previously, the algorithmic tool has the potential to be run on millions of surveys in the future, but in the short term this will be limited to thousands of respondents in the testing phases. These data will be used to validate the model. The models are generating output from the LLM based on inputs, but we are not training an LLM through this process, so no data whatsoever are used to develop the overall AI model.
4.3.5 - Sensitive attributes
Individuals are asked to detail their job, work and industry. Usually this is not disclosive, but answers may become identifiable if, for example, the job is unique like “Prime Minister” or if listed in an identifying way such as “self-employed investigator” for “J Smith Investigation, NewTown”. The SIC and SOC codes that will be assigned are publicly available information.
4.3.6 - Data completeness and representativeness
As we have not developed and fully tested the model, we cannot yet assess the representativeness of the data. However, our evaluation and testing plans are using a wide range of TLFS data to cover representation to investigate potential biases. We are specifically looking at disproportional representation or effectiveness in parts of the SIC coding frame or for roles known to be taken particularly by individuals with protected characteristics.
4.3.7 - Source data URL
This solution uses the following publicly available datasets to provide context for the LLM workflow and responses:
4.3.8 - Data collection
In the process of semantic search we shortlist potential candidates for SIC classification using cosine similarity which are then passed on to the LLM.
Evaluation datasets are using responses from a wave of the TLFS. Within this dataset, responses that may be personally identifying have been removed.
4.3.9 - Data cleaning
As the SIC and SOC indexes sometimes exceed the readability age for our survey standards, we are developing a Respondent Centred Frame that will be used to translate SIC descriptions to more user-friendly words. Survey Assist will draw on these words when creating follow-up questions.
4.3.10 - Data sharing agreements
N/A as the data are not shared (they are collected by ONS to start with) and they will not be shared outside of the Office for National Statistics.
4.3.11 - Data access and storage
The data are only accessible by ONS with the required stringent clearance, permissions and needs.
Data captured during the evaluation and research phases will be retained in a secure environment for the course of the development project. Once approved for go-live, this evaluation dataset will be stored only for as long as necessary to validate the performance of the tool. The data that will be collected via the tool will become part of the TLFS data which are kept indefinitely by the ONS as part of its legal requirements to have a time series of data on the labour force. The data will be stored in ONS’s secure data infrastructure which can only be accessed by internal staff with appropriate training, access and permissions.
Access to the data is strictly controlled. The data are stored in ONS’s highly secure data infrastructure with well-understood permissions and access restrictions to ensure that only trusted and responsible researchers with appropriate need, training and security clearance have access to the data. The access to personal data is restricted to authorised individuals who have undergone appropriate training as well as being limited only to those who absolutely must have access to the data for processing reasons. Any data that are published on the basis of the collected information are subject to rigorous statistical disclosure control which anonymises outputs. While the data are at rest in ONS they are subject to a suite of protections described by the Five Safes framework, including de-identification, physical and virtual security infrastructure, ethical assessment, security monitoring and so on. For more information see ONS’s website: https://www.ons.gov.uk/aboutus/usingpublicdatatoproducestatistics/keepingdatasafeandprotectingyourpersonalinformation
Tier 2 - Risks, Mitigations and Impact Assessments
5.1 - Impact assessment
The likelihood of harm from the use of personal data through this LLM is restricted by the fact that the data collected by the LLM are not directly identifiable, or where they may be (if for example a free text referred to something personal) then they are protected by ONS’s usual data protection and security provisions under the Five Safes framework. A Data Protection Impact Assessment is in place for the Labour Force Survey and TLFS, though the as noted the risks around this algorithmic tool are very low. The National Statistician’s Data Ethics Advisory Committee will be reviewing the tool to provide their expert assessment and recommendations to help guide the use of the algorithm.
5.2 - Risks and mitigations
The main risks with this tool are: 1. that it will inaccurately assign SIC and SOC code responses to survey responses. This will be mitigated by extensive research and testing phases prior to deployment, and by planned automatic testing which will involve a periodic run through of a clerically coded dataset to ensure consistent coding. The overall balance of SIC and SOC responses are also analysed as part of our reporting, with over- and under- represented areas investigated prior to publication.
- that it will generate questions which are not suitable for respondents: they may be unhelpful, irrelevant or offensive. This too will be mitigated by extensive research and testing phases and ensuring the prompt is engineered to reflect best practice survey question design (generated within the ONS). We will also subject the tool to profanity filters. Our intention is to work to model the least variance of questions possible to give the best increase in data quality, and we are exploring throughout the course of 2025 the best way to engineer prompts to ensure this happens. We intend that AI generated questions are identified as such within the survey flow, allowing users to skip questions if they are inappropriate.