Research and analysis

Summary of methodology

Published 30 April 2026

The UK Standard Skills Classification (SSC) was constructed in 3 distinct stages as set out in the Phase 1 report for this project ‘A Skills Classification for the UK: Plans for development and maintenance’. In each stage, Artificial Intelligence (AI) tools, in particular text embedding vector comparison and Large Language Model (LLM) evaluations, were used to validate, deduplicate and standardise multiple input datasets, with additional manual reviews to ensure accuracy, alignment and reliability of outputs.

The 3 main development stages were the creation of:

  • an Occupational Task library for the UK Standard Occupational Classification (SOC) 2020 Sub-Unit Groups (SUGs) (6-digit occupations)
  • a hierarchical classification of Occupational Skills, consisting of 4 levels, with the most detailed level also linked to a set of Core Skills
  • a library of Knowledge concepts

Mappings were then created between these 3 elements and to occupations, potentially related qualifications and other existing classifications.

This section outlines the process followed during each of these 3 main stages and discusses the unexpected issues that led to deviations from the original Phase 1 plan. A more detailed description of the methodology can be found in Appendix B.

Occupational Tasks

Figure 5 outlines the development process of the SSC Occupational Task library, detailing the input libraries used, the data cleaning steps, and the validation against other information sources.

Figure 5: Development of UK SSC Occupational Tasks

This is displayed as a series of processing steps from T1 to T6 in a row across the top of the diagram with each step shown below in a flow diagram. On the left-hand side are the 4 main input libraries which feed into the first processing step: T1.

The input libraries from top to bottom are:

  • the Graduate Futures Institute (GFI) responsibilities
  • Skills England Occupational Standard duties
  • the US Occupational Information Network (O*NET) tasks
  • National Careers Service (NCS) day-to-day tasks

The processing steps show:

  • T1 ‘validate as Task Statements’
  • T2 ‘cluster by SOC SUG
  • T3 ‘use AI to sub-cluster by meaning’
  • T4 ‘use AI to merge and deduplicate’ which has arrows pointing to 2 steps under T5
  • T5 ‘validate via SOC SUG description’ and ‘validate against job ads’ which both have arrows to step T6
  • T6 ‘Occupational Tasks’

Task statements were collected from the input libraries, then cleaned and standardised using AI tools to ensure clarity, consistency, and UK English usage.

The tasks were refined, deduplicated, and clustered using text embeddings from OpenAI models and hierarchical clustering methods. The resulting clusters of tasks were manually inspected to help merge overlapping clusters. The validation stage involved comparing these tasks to tasks extracted from SOC SUG descriptions and a database of around 8 million UK job vacancies. This ensured tasks were accurately mapped and relevant to real-world job roles.

The vacancy data used in this analysis is drawn from the Institute for Employment Research (IER) dataset, funded by the Department for Education (DfE) and covering the period from 2019 to 2024, with updates made in 2025 for the IER.

Finally, the SSC Occupational Task library was completed, comprising 22,583 tasks linked to SSC Skills, Knowledge concepts, and Occupations.

Occupational Skills and Core Skills

Figure 6: Development of UK SSC Occupational Skills

Figure 6 shows the equivalent process for the construction of the hierarchical classification of SSC Occupational Skills together with a set of 13 Core Skills.

This is displayed as a series of processing steps from S1 to S7 in a row across the top of the diagram with each step shown below in a flow diagram. On the left-hand side are the 6 input libraries which feed into the first processing step: S1.

The input libraries from top to bottom are:

  • European Skills, Competences, Qualifications and Occupations (ESCO) Level 4 skills
  • the National Careers Service (NCS) skills
  • O*NET Detailed Work Activities (DWAs)
  • Skills England Occupational Standard skills
  • Graduate Futures Institute (GFI) skills
  • the Workforce Foresighting Hub, Innovate UK (WFH) skills

The processing steps show:

  • S1 ‘validate as Skills’
  • S2 ‘cluster by meaning’
  • S3 ‘use AI to merge and deduplicate’
  • S4 ‘map against SOC SUGs’ which has arrows pointing to 2 steps under S5
  • S5 ‘validate against Tasks’ and ‘validate against job ads’ which both have arrows to step S6
  • S6 ‘Occupational Skills’
  • S7 ‘Core Skills’

Skill statements were sourced from the input libraries.

They were then standardised (in terms of structure, specificity, capitalisation, spelling and grammar) and quality assured using AI to ensure clarity, consistency, and relevance. These were refined and clustered using OpenAI embeddings and hierarchical models and then AI-generated labels and descriptions were created for each cluster.

These cluster labels were then manually reviewed and, where necessary, modified to use consistent language patterns to create the prototype set of Occupational Skills (the lowest level of the SSC).

These Occupational Skills were then organised into a hierarchy of Skill Groups, Skill Areas, and Skill Domains, with AI prompts used to validate structure and relatedness. These were then mapped to SOC SUGs and validated against SSC Tasks. Further validation against skills extracted from the job vacancy descriptions was carried out to ensure coverage across occupations and relevance, with additional high-quality skills from the vacancy data added where gaps were found.

A separate set of 13 SSC Core Skills was defined, with AI used to create level definitions and assess proficiency across skills and occupations. The final classification includes 3,350 SSC Occupational Skills, structured into 607 Skill Groups, 106 Skill Areas, and 22 Skill Domains, linked to SSC Core Skills, Tasks, Knowledge concepts, Occupations, and Qualifications.

Occupational Knowledge Concepts

The third main stage in the development of the SSC was the development of a library of Knowledge concepts. This process is outlined in Figure 7.

Figure 7: Development of UK SSC Knowledge concepts

Figure 7 illustrates the process to develop the SSC library of Knowledge concepts.

This is displayed as a series of processing steps from K1 to K6 in a row across the top of the diagram with each step shown below in a flow diagram. On the left-hand side are the 6 main input libraries which feed into the first processing step: K1.

The input libraries from top to bottom are:

  • ESCO (European Skills, Competences, Qualifications and Occupations) knowledge concepts
  • Higher Education Coding of Subjects (HECoS)
  • Learn Direct Classification of Subject Codes (LDCSC)
  • O*NET (knowledge, tools used and technology skills)
  • Stack Exchange (topic tags)
  • Wikipedia (article titles)

The processing steps show:

  • K1 ‘validate as Knowledge concepts’
  • K2 ‘cluster by meaning’
  • K3 ‘use AI to merge and deduplicate’ which has arrows pointing to 5 steps under K4
  • K4 ‘validate versus Ofqual’, ‘validate versus Skills England’, ‘validate versus tasks’, ‘validate versus job ads’, and ‘validate versus prototype’ which each have arrows to step K5
  • K5 ‘identify primary concepts’
  • K6 ‘Occupational Knowledge’

The Knowledge concept, subject and topic names were collected from the input libraries.

These were cleaned and filtered using AI tools to retain only concepts evidenced in a UK context. The concepts were then refined and grouped by meaning using embeddings and clustering methods. The validation steps involved mapping these knowledge concepts to external sources including Ofqual, Skills England Occupational Standards, SOC SUGs, SSC Tasks, and UK job vacancy data, ensuring relevance and common usage. Embedding matches were also used to link SSC Skills to Knowledge concepts and assess their importance. Primary concept types and related terms were identified, resulting in a final set of 5,056 SSC Knowledge concepts linked to SSC Tasks, Skills, and subject classifications.

Once these 3 main development stages were complete (SSC Occupational Tasks, SSC Occupational Skills, and SSC Knowledge concepts), Al tools were used to create mappings to other classifications and data sources and to create different groupings of the skills, such as Science, Technology and Engineering, Mathematics, Medicine and Health (STEM-M&H) skills, Green skills and Digital skills.

Unexpected issues

Data inputs

The Specialist Tasks from the Australian Skills Classification (ASC) were planned as an input for the skills library, but the dataset was withdrawn in early 2024 by Jobs and Skills Australia (JSA) as part of a plan to replace the ASC with a National Skills Taxonomy. The JSA cited issues with connectivity to education contexts. Given that the ASC content was originally derived from the O*NET Detailed Work Activity framework which is already included as an input library to the SSC, the decision was taken to exclude ASC Specialist Tasks from the development process.

Data was requested from LinkedIn to supplement other inputs (especially for the Knowledge concept library) but, unfortunately, they were unable to provide access to the level of data required. To improve coverage of newer subjects and Knowledge concepts (such as those related to AI) we included technologies and techniques identified from several other sources such as Innovate UK Workforce Foresighting Hub challenge cycles.

Output validation

The initial design report included a step to validate the classification against a large CV library but licensing this, or an up-to-date equivalent, proved prohibitively complex and expensive.

Artificial Intelligence tool advice and guidance

The speed and scale of SSC content development and validation would not have been possible without the recent AI-driven advances in natural language processing and generation tools. There are however still limitations and nuances in using these tools and the following guidance is therefore offered to help those attempting similar work.

Text-embeddings

These are generated by machine learning models that convert words, sentences, or documents into numerical vectors (embeddings) that capture their semantic meaning. These embeddings enable computers to identify, compare and cluster text based on underlying meaning rather than just exact word matches.

Embedding models

There are several embedding models (both commercial and non-commercial) but the 3 Large model from OpenAI was found to be the most useful and reliable for this project.

Short text-string embeddings

Embedding comparison scores (typically cosine similarity) are generally less reliable for short phrases, and especially those that are ambiguous (such as “Interpret communication using NLP” which could be referring to “Natural Language Processing” or “Neuro-linguistic Programming”). Concatenating labels with hyphen separated descriptions can be a cost-effective way of mitigating this issue, provided that the descriptions are accurate and unambiguous.

Data cleaning

Input data cleaning and standardisation matters as statements that are inconsistently capitalised, punctuated and structured will not match as reliably. For example, Table 4 shows the embedding vector cosine similarity scores of 3 phrases against the skill label ‘Provide advice on trademarks’

Table 4: Embedding similarity scores to phrase: ‘Provide advice on trademarks’

ID number Difference Comparison Phrase Embedding Similarity Score
ID 1 Statement reworded Consult with clients about trademark issues 0.78
ID 2 Statement reformatted (Capitalisation, grammar and trailing spaces) consult with clients  about  trademark issues. 0.65
ID 3 Statement with different meaning Provide advice on trade controls 0.67

Statement ID 1 is reworded but is syntactically correct and consistent and therefore has a fairly high similarity score of 0.78.

In contrast, statement ID 2 is formatted differently and includes character anomalies such as trailing spaces which add noise to the match resulting in a match score of 0.65.

Statement ID 3 which has a different meaning (trade controls being a distinct concept to trademarks) has a match score of 0.67. This means that, without data cleaning, the original statement would be incorrectly evaluated as a closer match to statement ID 3 rather than statement ID 2.

Large Language Model (LLM) prompts

These are structured data requests directed at Large Language Model (LLMAPIs to process large volumes of data in a consistent way.

Language model selection

Even over just the 2 year duration of this project, the performance improvements in LLMs have been remarkable. Progress has not however been entirely convergent and different LLMs still perform significantly better at some tasks than others. For the SSC, OpenAI models have tended to perform best when compared against others, especially at tasks requiring the assignment of match ‘scores’ (such as task to skill importance or relatedness).

Prompt design and validation

Despite increasing context windows (i.e. the amount of text or data that can be included in an LLM prompt) there is still a balance to be struck with the number and complexity of instructions included. This is because longer lists of rules or criteria seem to increase the likelihood of some instructions being omitted or ignored. To find this balance, practitioners can use a randomised sample to validate output coverage while also including specific examples to test for known difficulties. Rerunning the prompt against the same dataset can also usefully reveal inconsistencies or ambiguities in both inputs and outputs.

Generative inconsistency and bias

LLMs tend to have a bias toward generating US English. This tendency can be mitigated by explicitly instructing the model to use UK English terms and spelling. Using a UK English dictionary to spell-check the output is also recommended.

Statement or label categorisation

For content quality or format checks (such as ‘does this statement describe a skill?’) LLMs perform more consistently when asked to apply a pre-defined categorisation framework e.g. ‘Code #3: Too generic - This statement is too generic and isn’t describing a specific skill.’ or ‘Code #4: Invalid - This statement does not describe a skill and is instead a tool, subject, attitude or outcome’. For a full example of this approach, see the first prompt given in Appendix B.

Concept tagging

For mappings (such as tasks to skills) LLMs still struggle to categorise or tag statements from long lists of options. A better approach is to compare a text-embedding of an input (such as a task statement) against text-embeddings for all classification concepts or tags (such as skill labels) to generate a longlist of potential matches. An LLM prompt can then be used to iterate through and evaluate the potential matches one at a time. An example of this approach is provided in the second prompt in Appendix B.

In summary, AI tools have proved invaluable in the development of the SSC, but they are not entirely reliable and so (both manual and AI) controls and checks are needed at every stage to ensure that outputs conform to requirements.

Prototype content piloting

To extend validation of prototype content, we collaborated on a number of pilot projects.

Built Environment – Smarter Transformation (BE–ST) – AI PathwaysPro

Overview

This was in the context of a broader project focused on low-carbon upskilling, and specifically Passivhaus, Retrofit and associated digital skills. The pilot evaluated the extent to which the SSC prototype datasets could help map between existing datasets (e.g. NOS and Skills England Occupational Standards) and create or refine related competency-based upskilling profiles and associated pathways. In addition, it assessed how the SSC datasets could inform the development of an ontology of the Retrofit domain as part of the AI PathwaysPro project led by Dynamic Knowledge & Intelartes Ltd.

Key findings and corrective actions

The evaluation found useful links to relevant Passivhaus and Retrofit concepts but also a number of gaps. New knowledge areas such as ‘Building airtightness’ and ‘Mould health issues’ were added to the SSC knowledge classification along with several new tasks related to Retrofit activities.

Workforce Foresighting Hub

Overview

This involved the use of SSC prototype content in the analysis and design of workforce transitions across 12 foresighting cycles, including localised vaccine manufacture, nuclear waste processing and automated welding. The SSC task library, skills and knowledge concepts were all used to help define current and future state job profiles and enable comparisons against current training provision.

Key findings and corrective actions

The skills classification performed well and largely met the user need without custom additions. There were however several knowledge areas that were missing key concepts (e.g. Recombinant DNA and Small Modular Reactors (SMRs)). As a result, these were added to the knowledge classification and associated mappings.

UK Retail Bank

Overview

This was in the context of graduate recruitment and in particular an attempt to diversify the educational and gender diversity of applicants into digital roles. The project involved an analysis of graduate role profiles within the bank and matching these via the SSC-HECoS mapping to undergraduate degree programmes at the Common Aggregator Hierarchy code (CAH3) level.

Key findings and corrective actions

The analysis confirmed that there was sufficient alignment between the graduate roles and the SSC to help identify diverse graduate applicant pools. For example, history graduates had many of the skills required by data analyst roles, psychologists were well aligned to the data science stream and marketing graduates were matched to roles around delivery coordination and management.

DWP - Jobs and Careers Service (JCS)

Overview

This was a small pilot to evaluate the potential to use the SSC to help generate standardised skills profiles from CVs. A process was developed to first use LLMs to extract and convert work history information (such as tasks and skills) from a set of example CVs. The extracted information was then used to identify potential matches to SSC occupational skills, an AI-based evaluation to quantify the relative strength of these matches and then an aggregation process to generate individualised skill profiles.

Key findings and corrective actions

The full output validation was unfortunately not possible but face validity of outputs was high. As one key learning, the tasks identified as having potential SSC skill matches should not be evaluated individually by LLMs but rather as a whole list of tasks against all potential SSC skill matches for a related role. This is necessary to avoid similar task statements resulting in skills being overweighted in the final skills profile.

Creative Industries Policy and Evidence Centre (Creative PEC)

Overview

This was an evaluation of the extent to which the SSC could help capture and codify the skills identified as shortages / gaps / in need of upgrading in future as part of the Creative PEC/ Work Advance led Creative Industries Skills Audit.

Key findings and corrective actions

The fully automated (AI-led) match approach yielded some interesting learnings on operationalising the classification (e.g. via employer surveys, data cleaning, AI matching etc) but was not deemed sufficient on its own to evaluate the classification. A manual analysis however concluded that the SSC is generally ‘very well-aligned’ to the skills demands of the Creative Industries. The exercise identified an opportunity to refine the SSC in places, to more fully capture emerging skills areas, sector-specific technologies and highly specialised technical skills, with the Design Council providing detailed feedback, for example, to strengthen the representation of sustainable design within the new classification.

Other prototype content feedback

Detailed feedback mainly captured using the UK Skills Explorer feedback interface, from a range of organisations and professional bodies including Enginuity and The Royal Society of Chemistry was also used to refine both content definitions and mappings for the SSC. Given that this is an ongoing process, a cut-off date of 17 April 2026 was established for revisions to be incorporated within SSC Version 1.0.