Appendix B: full methodology

Question 1

Tasks

Accepted Answer

Figure 8: Development of SSC Occupational Tasks

Figure 8 outlines the development process of the SSC Occupational Task library, detailing the input libraries used, the data cleaning steps, and the validation against other information sources.

This is displayed as a series of processing steps from T1 to T6 in a row across the top of the diagram with each step shown below in a flow diagram. On the left hand side are the 4 input libraries which feed into the first processing step: T1.

The ‘T’ prefix for each step ID (such as T1) relates to ‘Task’. For similar processes shared in later sections an ‘S’ prefix relates to ‘Skills’ and a ‘K’ prefix to ‘Knowledge ’.

The input libraries from top to bottom are:

the Association of Graduate Careers Advisory Services (AGCAS) responsibilities
the Institute for Apprenticeships and Technical Education (IfATE) duties
the US Occupational Information Network (O*NET) tasks
National Careers Service (NCS) day-to-day tasks

The processing steps show:

T1 ‘validate as Task Statements’
T2 ‘cluster by SOC Ext’
T3 ‘use AI to sub-cluster by meaning’
T4 ‘use AI to merge and duplicate’ which has arrows pointing to 2 steps under T5
T5 ‘validate via SOC Ext description’ and ‘validate against job ads’ which both have arrows to step T6
T6 ‘Occupational Tasks’

T1: Process and validate inputs

Task statement libraries were obtained from AGCAS (responsibilities), IfATE (duties), O*NET (tasks), and National Careers Service (Day-to-day tasks). These libraries were then cleaned and standardised using AI tools. AI tools were used to quality assure the tasks statements and correct tasks that were too generic, too specific, too wordy, incorrectly structured, compound tasks or not tasks. The quality assurance process also converted US spellings and phrasing to UK English.

T2 – T4: Refine, deduplicate and cluster

Text embeddings were generated using two models: OpenAI 3-Large and Bidirectional Encoder Representations from Transformers (BERT) MP-Net and then a variety of cluster models were tested and compared to remove duplicate and similar tasks. OpenAI 3-Large embeddings with a hierarchical clustering model was found to have the best results.

Clustered tasks were sorted by meaning (based on embeddings) to identify overlapping and close clusters and merged through manual inspection. Orphan clusters (those containing only one task) were integrated with multi-task clusters using results from other clustering and embeddings models.

The centroid task statement within each cluster was identified and became the task label. These cluster labels then became the initial version of the SSC Task library.

T5: Validate against other sources

SOC SUGs

Tasks were extracted from all SOC SUG descriptions (except n.e.c. groups ending /99) using the Llama3 LLM and then embeddings were created to enable matching to the SSC Tasks. The similarity between the SUG description task embeddings and the SSC Task embeddings was calculated to provide a numerical score representing the degree of similarity. The best matching SSC Task for each SUG description task was identified so that SSC Tasks were assigned to all relevant SOC SUGs. Potential Task to SUG matches were also identified via an analysis of existing job profiles such as those within O*NET where associated task statements appear in clusters used to derive SSC Tasks.

Further AI prompts were used to check the combined mappings and estimate the relatedness of these tasks to the associated SUGs. Significant discrepancies between a legacy mapping (such as a task match with a high level of importance within an O*NET profile but rejected by the AI analysis) were manually checked and reconciled.

Vacancy data

The IER holds a large vacancy database which is coded to SOC SUGs. A sample of distinct vacancy descriptions was created with a maximum size of 200 vacancies per SUG. The sample was selected from vacancies with longer job descriptions and those that were well coded to each SUG.

Llama3 was used to extract tasks from this database of vacancy descriptions and then the tasks were quality assured, clustered, and embeddings created using a similar process to the creation of the task library (T2-T4). These embeddings were then compared to the SSC Task embeddings. Vacancy tasks that were quality assured as being ‘good’ tasks but had a low similarity score to an existing SSC Task were manually inspected to identify any tasks that should be added to the SSC Task library.

The database of vacancy tasks was also used to identify additional tasks for SUGs with no or low numbers of associated tasks and similarly for SSC Skills with no linked SSC Tasks.

T6: Final SSC Occupational Tasks

The final list of SSC Tasks consists of 21,963 tasks. These are linked to SSC Skills and Knowledge concepts and Occupations.

Question 2

Skills

Accepted Answer

Figure 9: Development of SSC Occupational Skills

Figure 9 shows the equivalent process for the construction of the hierarchical classification of SSC Occupational Skills together with a set of 13 Core Skills.

This is displayed as a series of processing steps from S1 to S7 in a row across the top of the diagram with each step shown below in a flow diagram. On the left hand side are the 6 input libraries which feed into the first processing step: S1.

S1 to S7 refer to each processing step in the creation of the Occupational Skills library.

The input libraries from top to bottom are:

European Skills, Competences, Qualifications and Occupations (ESCO) Level 4 skills
the National Careers Service (NCS) skills
O*NET Detailed Work Activities (DWAs)
IfATE skills
AGCAS skills
the Workforce Foresighting Hub, Innovate UK (WFH)

The processing steps show:

S1 ‘validate as Skills’
S2 ‘cluster by meaning’
S3 ‘use AI to merge and deduplicate’
S4 ‘map against SOC Ex SUGs’ which has arrows pointing to 2 steps under S5
S5 ‘validate against Tasks’ and ‘validate against job ads’ which both have arrows to step S6
S6 ‘Occupational Skills’
S7 ‘Core Skills’

S1: Process and validate inputs

Skills statement libraries were obtained from AGCAS (Skills), ESCO (Level 4 skills), IfATE (skills), Innovate UK Workforce Foresighting Hub (skills), O*NET (Detailed Work Activities) and the National Careers Service (skills). These libraries were cleaned and standardised using AI tools. AI tools were again used to quality assure the skill statements and correct any that were too generic, incorrectly structured, compound, invalid, elementary, ambiguous, traversal, or too specific.

The text below shows an example of a prompt used to quality assure skill statements:

prompt_text=”””

A good occupational skill label complies with all of the following criteria:

1. It describes a skill that requires significant training and practice to acquire.

2. It describes a skill and not an attitude or outcome. For example, ‘maintaining a positive outlook’ or ‘Ensuring customer satisfaction’ would therefore not qualify as occupational skills.

3. It describes a skill that is developed and not innate. For example, ‘a good sense of smell’ is not a skill although “Smelling foods and ingredients to evaluate quality” is.

4. It begins with an action-based verb followed by a specific noun (i.e. describes something being actively done to an object).

5. It is no more than nine words long (and ideally between three and six).

6. It is unambiguous (meaning it describes a specific skill and couldn’t be misinterpreted as something else)

7. It describes a specialist skill and therefore is only relevant to a subset of jobs. For example, “supervise workers” is too broad

8. It describes a skill that is broad enough to be relevant to or transferable between multiple jobs but not overly generic

Examples of good occupational skill labels include:

1. Install heat pumps

2. Administer standardised psychological tests

3. Manage software development projects

4. Read musical scores

5. Inspect aircraft to check airworthiness

6. Design relational database schemas

Quality Evaluation Category Codes, Category Names and - Rewriting Guidance:

For evaluation and, where necessary, editing, occupational skill labels can be classified into one or more of the following categories:

Good - This label meets all the criteria
Compound - This describes multiple skills. It needs to be split into multiple skill labels, one per different skill.
Too Generic - This is too generic and isn’t describing a specific skill.
Invalid - This does not describe a skill and is instead a tool, subject, attitude or outcome. It needs to be removed.
Too Complex – The vocabulary used to define the skill is verbose and unnecessarily difficult to read. It needs to be simplified.
Disordered – This label does not follow the verb-noun sequential format. It needs to be rewritten to present the information in this order.
Elementary - This is an unskilled or very low-skilled activity
Ambiguous - This label could represent two totally different skills
Traversal - This is a skill that is very broad and is required in a wide variety of unrelated job role
Too Specific - This is a skill that is too specialised and only relevant to a specific part of one job

Quality Evaluation Category Examples:

Examples of skill labels assigned to the various evaluation categories (some examples may belong to more than one category)

Good – “Administer standardised psychological tests.”
Compound – “Design, administer & interpret standardised psychological tests.”
Too Generic – “Analyse data.”
Invalid – “Stay positive.”
Too Complex – “Apply research ethics and scientific integrity principles in research activities.”
Disordered – “Safe working Practices: Meet legal, industry and organisational requirements.”
Elementary - “Fill kettle with water” or “Pass dental instruments.”
Ambiguous - “Conduct pipeline analysis” (this is ambiguous as it could refer to an oil or data pipeline)
Traversal - “Think analytically”
Too Specific - “Repair vehicles with fuel-injection problems”

With this context, please evaluate the occupational skill labels in the provided list of tuples (containing the statement_id and statement_text) and assign each one to one or more of the of the Evaluation Category codes.

Next step:

Rewrite each statement by applying the rewriting guidance for all of its category codes as well as using the original criteria for good occupational skill labels and examples of good skill labels provided.

For example, a code 2 (Compound) statement should be split into two distinct skill labels.

If the original statement does not contain enough information to apply the guidance properly then instead assign a label “Insufficient content to rewrite”.

Finally, return a json list of dictionaries (one dictionary per record) containing (in the following order):

1) Statement_id:

2) Statement_text:

3) Evaluation_categories: A comma separated list of the Evaluation Category codes and their corresponding names

4) Statement_refined: The rewritten statement or statements or the label “Insufficient content” (*If there is more than one statement, these should be separated by the “#” character.)

“””

Please note that this prompt was developed in May 2024 and used with the LLM model OpenAI gpt-4o. Current LLMs are significantly more capable and the prompt could be improved (quite possibly by an LLM) to produce better results. Use of this exact prompt is therefore not recommended.

S2 - S3: Refine, deduplicate and cluster

OpenAI 3-Large embeddings were created and a hierarchical clustering model was used to deduplicate and refine the library of skills.

Skill clusters were then sorted by meaning to identify overlapping clusters and these were manually inspected for inclusion or deletion. AI prompts were used to analyse the consistency of the skill clusters and generate a new skill label to best describe the cluster of skills (rather than using the centroid skill as the label).

The verbs in the skill labels were standardised and became the SSC Skills.

AI tools were used to write a description of the SSC Skill label and then a further prompt identified any ambiguous skills labels and descriptions which were rewritten.

Create Skill Groups, Areas and Domains

The SSC Skills were clustered to create Skill Groups and parent or child overlaps were manually checked. An AI prompt was used to check the SSC Skills within each Skill Group and identify any overlapping Skill Groups.

The Skill Groups were then clustered to create Skill Areas and the language of the Skill Groups and Skill Areas was standardised. An AI prompt was used to check the skills in each Skill Area and return a skill relatedness score.

The Skill Areas were then mapped to Skill Domains and an AI prompt used to check SSC skills within Skill Domains.

S4: Map against SOC SUGs

A mapping was created from SOC SUGs to SSC Skills based primarily on the occupational mappings in the input skill libraries.

AI Prompts were run to review the SSC Skills associated with each SUG.

Example prompt:

“””

You are a skills analyst and need to check whether a list of skills have been correctly matched to a specific job

To do this you will be given a list object that contains:
1) A job_id
2) A job title and description (hyphen separated) 3) A list of ; separated tuples containing a skill_id and skill_label
For example:
[1132/02,’Sales directors - Sales directors are responsible for overseeing all sales operations for an organisation or business’,(12553;Supervise sales staff);(30110;Analyse sales data);(968;Set sales targets);(5798;Direct sales activities)]

For each of these job_skill lists please therefore:
1) Evaluate each skill_label and assign a % probability that the skill it describes is required to perform the job that it has been associated with.
2) Return a json list of dictionaries (one dictionary per job-skill pair) containing (in the following order):
1) job_id:
2) skill_id:
3) probability: (a % value as an integer e.g. 70)

“””

Weighted frequencies were then calculated to show how SUGs relate to SSC Skill Groups and SSC Skill Areas.

S5: Validate against other sources

SSC Tasks

The SSC Skills embeddings were compared to SSC Tasks embeddings to identify links between them. This mapping was then checked using an AI prompt and a further prompt defined the importance score of the SSC Skill to the SSC Task.

Vacancy data

Following a similar process to the validation of tasks using vacancy data, skills were extracted from a sample of vacancy descriptions using Llama3. These were quality assured using AI and then embeddings were created and the vacancy skills were clustered within each SUG and then across all SUGs. The centroid embedding within each cluster became the vacancy skill label. These were then compared to the SSC Skills embeddings to check coverage and any vacancy skills quality assured as being of good quality with a low similarity score to the SSC Skills were inspected for inclusion.

S6: Final SSC Occupational Skills

The set of SSC Skills consists of a hierarchy of 3,343 Occupational Skills, 606 Skill Groups, 106 Skill Areas and 22 Skill Domains. These are linked to SSC Core Skills, SSC Tasks and Knowledge concepts, occupations and courses.

S7: Core Skills

The Skills Builder Partnership essential skill concepts were considered and then a list of 13 SSC Core Skills and definitions were drawn up.

AI prompts were used to help create definitions for each of the 5 skill levels of each SSC Core Skill and then to evaluate the level of Core Skill proficiency in each SSC Skill and each SOC SUG. Several AI models were used in this step to try to attain the best and most consistent results.

Question 3

Knowledge

Accepted Answer

Figure 10: Development of SSC Knowledge concepts

Figure 10 illustrates the process to develop the SSC library of Knowledge concepts.

K1 to K6 refer to each processing step in the creation of the Occupational Knowledge library.

The input libraries from top to bottom are:

ESCO (European Skills, Competences, Qualifications and Occupations) Knowledge concepts
Higher Education Coding of Subjects (HECoS)
Learn Direct Classification of Subject Codes (LDCSC)
O*NET (knowledge, tools used and technology skills)
Stack Exchange (topic tags)
Wikipedia (article titles)

The processing steps show:

K1 ‘validate as Knowledge concepts’
K2 ‘cluster by meaning’
K3 ‘use AI to merge and deduplicate’ which has arrows pointing to 5 steps under K4
K4 ‘validate versus Ofqual’, ‘validate versus IfATE’, ‘validate versus tasks’, ‘validate versus job ads’, and ‘validate versus prototype’ which each have arrows to step K5
K5 ‘identify primary concepts’
K6 ‘Occupational Knowledge’

The Knowledge concept, subject and topic names were collected from the input libraries.

K1: Process and validate inputs

Knowledge libraries were obtained from ESCO (Knowledge), HECoS (Higher Education Coding of Subjects), LDCSC (Learn Direct Classification of Subject Codes), O*NET (Knowledge, Tools Used & Technology Skills), Stack Exchange (Topic Tags) and Wikipedia (Article Titles). These were cleaned and standardised using AI tools. The list of Knowledge concepts was checked for any matching or equivalent terms and then filtered to only include concepts that were evident within a UK context.

K2 – K3: Refine, deduplicate and cluster

Knowledge concepts were clustered by meaning using embeddings and further deduplicated using clustering methods.

K4: Validate against other sources

Ofqual

Up to 50 potential matches per qualification were identified by comparing a text embedding vector of a concatenated text string of each qualification title and its associated qualification units against a text embedding for each SSC Knowledge concept label.

Text embedding vectors were generated using the OpenAI 3-Large Model with a cosine-similarity match threshold of 0.3 being applied. Matches above this threshold were then evaluated by prompting an LLM (gpt-4.1-2025-04-14) with a simplified text string for each qualification (its simplified title and up to 5 example qualification units) to validate each match and also, where appropriate, assign a percentage probability that “a significant amount of knowledge in that area would be learnt by achieving that qualification”. Following a sample inspection, matches assigned a match probability score below 50% were rejected.

The closest Sector Subject Areas were identified using embedding matches and validated using an LLM prompt and manual inspection.

IfATE and Skills England

Up to 10 potential matches per Occupational Standard Knowledge statement were identified by comparing a text embedding vector of a concatenated text string of each statement and its associated Occupational Standard against a text embedding for each SSC Knowledge concept label. Text embedding vectors were generated using the OpenAI 3-Large Model with a cosine-similarity match threshold of 0.3 being applied. Matches above this threshold were then evaluated by prompting an LLM (gpt-4.1-2025-04-14) and, where appropriate, assign a percentage importance of the knowledge to that statement. Following a sample inspection, matches assigned a probability score below 50% were rejected.

SSC Tasks

Embeddings matches were also used to assign SSC Knowledge concepts to SSC Tasks and then an AI prompt checked whether the Knowledge concepts had been correctly assigned to Tasks. An AI prompt was then used to define the importance score of the Knowledge to the Task.

Vacancy data

The sample of vacancy descriptions was searched for the SSC Knowledge concepts to check that they are all terms in common usage.

K5: Primary concepts

The primary concept type and potentially related concepts were identified using embeddings matches and checked using LLM prompts.

K6: Final SSC Occupational Knowledge concepts

The final set of SSC Knowledge concepts consists of 4,926 concepts linked to SSC Tasks, SSC Skills and subjects.

Question 4

Secondary mappings

Accepted Answer

Secondary mappings to existing classifications of Skills, Tasks and Knowledge concepts were created using embeddings matches. The full list of secondary mappings available can be found in Appendix A.

Question 5

Skill categorisations

Accepted Answer

Numeracy skills and Digital skills

These classifications were created using the SSC Skills that were rated as requiring an expert level of proficiency in the SSC Core Skills of Numeracy and Digital Literacy.

Green skills

An AI prompt was used to score each SSC Skill in how related (directly or indirectly) it is to the UK’s net zero emissions target and other environmental goals. Using previous work to define the Green SOC (see journal article by Warhurst, C., Harris, J., Cardenas-Rubio, J. C., & Anderson, P. (2025). A just transition?: Green jobs, good jobs and labour market inclusivity in Scotland. European Journal of Workplace Innovation, 9(1-2), 63-79.), the skills mapped to green SUGs were also identified. A manual inspection of the skills with high AI green scores and those mapped to green SUGs was then carried out to identify a list of Green and Green enabling skills.

STEM-M&H skills

An AI prompt was used to assign HECoS subjects to each of the STEM-M&H categories or ‘Not STEM-M&H’ and then using the mapping from HECoS subjects to SSC Skills, a list of the STEM-M&H types of the subjects linked to each skill was created.

Secondly, using previous work to define SUGs as STEM-M&H and the SUG to SSC Skill mapping, a list of the STEM-M&H types of the SUGs linked to each skill was created.

Thirdly, an AI prompt was used to categorise each of the SSC Skills to one of the STEM-M&H categories or ‘Not STEM-M&H’. Where all three sources of information agreed, then the skill could be assigned to that category. A manual inspection of the remaining skills was carried out to define the STEM-M&H category.

Appendix B: full methodology

Tasks

Figure 8: Development of SSC Occupational Tasks

T1: Process and validate inputs

T2 – T4: Refine, deduplicate and cluster

T5: Validate against other sources

SOC SUGs

Vacancy data

T6: Final SSC Occupational Tasks

Skills

Figure 9: Development of SSC Occupational Skills

S1: Process and validate inputs

S2 - S3: Refine, deduplicate and cluster

Create Skill Groups, Areas and Domains

S4: Map against SOC SUGs

S5: Validate against other sources

SSC Tasks

Vacancy data

S6: Final SSC Occupational Skills

S7: Core Skills

Knowledge

Figure 10: Development of SSC Knowledge concepts

K1: Process and validate inputs

K2 – K3: Refine, deduplicate and cluster

K4: Validate against other sources

Ofqual

IfATE and Skills England

SSC Tasks

Vacancy data

K5: Primary concepts

K6: Final SSC Occupational Knowledge concepts

Secondary mappings

Skill categorisations

Numeracy skills and Digital skills

Green skills

STEM-M&H skills

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK

Cookies on GOV.UK

Tasks

Figure 8: Development of SSC Occupational Tasks

T1: Process and validate inputs

T2 – T4: Refine, deduplicate and cluster

T5: Validate against other sources

SOC SUGs

Vacancy data

T6: Final SSC Occupational Tasks

Skills

Figure 9: Development of SSC Occupational Skills

S1: Process and validate inputs

S2 - S3: Refine, deduplicate and cluster

Create Skill Groups, Areas and Domains

S4: Map against SOC SUGs

S5: Validate against other sources

SSC Tasks

Vacancy data

S6: Final SSC Occupational Skills

S7: Core Skills

Knowledge

Figure 10: Development of SSC Knowledge concepts

K1: Process and validate inputs

K2 – K3: Refine, deduplicate and cluster

K4: Validate against other sources

Ofqual

IfATE and Skills England

SSC Tasks

Vacancy data

K5: Primary concepts

K6: Final SSC Occupational Knowledge concepts

Secondary mappings

Skill categorisations

Numeracy skills and Digital skills

Green skills

STEM-M&H skills

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK