Evaluation and performance analysis strategy and playbook 2024 to 2028

Question 1

Foreword

Accepted Answer

The Civil Service is experiencing unprecedented change with an ambitious programme of modernisation detailed in the government’s blueprint for modern digital government. At the heart of this change is a ‘digital first’ mindset for a productive and agile state; a belief in the power of internet age tools and techniques to improve our public services and economic competitiveness.

For the Digital, Data and Technology (DDaT) directorate, our role is to equip staff with the digital tools they need to work more efficiently and innovatively, to upskill colleagues in embracing new digital techniques, and to streamline and simplify operational processes.

In terms of public facing services, DBT’s context is different. Unlike other departments, users of our digital products and services usually number in the thousands rather than millions each year. However, despite this lower number, they can have a big impact to the economy by securing more inward investment, supporting businesses to grow and export.

We assess the impact of our digital offering in terms of catalysing sustainable economic growth, job creation and promoting inward investment to adapt and grow emerging and established sectors. This will require a cost-effective, proactive, personalised single digital experience for business users; fast-tracking investment decisions and seizing new market opportunities.

DDaT’s monitoring and evaluation (M&E) team identifies appropriate performance indicators, measures success and assesses value for money. This informs challenging decisions around which digital opportunities DBT invests in and whether those investments continue to deliver against defined outcomes.

DDaT’s ‘fail fast’ principle means that, where M&E evidence shows insufficient impact or value from investments, those activities will either be stopped or reformed to quickly identify and address failures. This is how we hold ourselves and other DBT teams to account and ensure that every hard-earned pound of taxpayers’ money is invested efficiently to evolve the public services they benefit from.

It is vital we bring the right skills and rigour to this work. Without this we cannot answer the assumptions which underlie much of the debate around digital public services. Are they faster, better or cheaper? Are there unintended effects? Only through careful whole-life analysis can we truly answer these questions to better inform future investment choices.

In this strategy we set out the role of M&E for digital services. In particular, the Government Service Standard (GSS) has provided inspiration for what ‘good’ looks like in the realm of digital products and services. We are proud to be the first government department to embed M&E into our DDaT product teams, baking our approach in right from the outset to enable us to hold ourselves to account in delivering those outcomes and understanding what works, what does not work and why.

This strategy gives us the path towards greater insight and understanding into where our future focus should be as government puts ever more emphasis on digital.

Jason Kitcat

Director of Digital, Data and Technology, Department for Business and Trade

Question 2

Glossary of acronyms

Accepted Answer

AI: artificial intelligence
CDDO: Central Digital and Data Office
DBT: Department for Business and Trade
DDaT: Digital, Data and Technology
GA: Google Analytics
GA4: Google Analytics 4
GDS: Government Digital Service
GES: Government Economic Service
GSR: Government Social Research
GSS: Government Statistical Service
GTM: Google Tag Manager
HCSAT: harmonised customer satisfaction
HMT: His Majesty’s Treasury
IPA: Infrastructure and Projects Authority
KPI: key performance indicator
LLM: large language model
M&E: monitoring and evaluation
MVP: minimum viable product
NAO: National Audit Office
PA: performance analysis
RCTs: randomised controlled trials
ROAMEF: rationale, objectives, appraisal, monitoring, evaluation, feedback
SMT: Senior Management Team
SQL: Structured Query Language
SS10: Service Standard 10
ToC: theory of change
UCD: user centred design
UR: user research
VfM: value for money

Question 3

DBT’s DDaT evaluation and performance analysis strategy

Accepted Answer

Introduction

Digital, Data and Technology (DDaT) is a strategic enabler for the Department for Business and Trade (DBT). DBT has a pivotal role to play in rebalancing the UK economy and driving economic growth. DDaT contributes to this by developing and supporting a large range of digital products and services, covering all key policy areas in the department, including:

free trade agreements
exports
business regulation
investment support

These areas are crucial for the UK’s economic prospects. In the context where digitally empowered transformation is an important government priority, continuous improvement to our digital services which provide businesses with critical support in access to finance, growth advice, and export support is paramount. In addition, DDaT provides many of the internal services that enable DBT staff to do our work, empowering colleagues across the department to take advantage of our suite of digital tools through upskilling.

Spending decisions inevitably involve trade-offs. Departments need to provide good evidence on which products work best and why, and how to continually improve our services and digital offer. Establishing a rigorous approach to M&E is a crucial input into evidence-based policy.

It is critical to allowing DBT to deliver the most impact with available resources – to establish if digital is indeed delivering at a much lower cost. This evidence is important in holding ourselves to account, to enhance current decisions and to support those that will be made in the future.

DDaT DBT is the first department that embedded M&E into product development teams^{[footnote 1]}, adhering to the principles set out within department’s M&E strategy^{[footnote 2]}. This is part of a systematic drive in DDaT DBT to ensure that all our products and programmes can demonstrate not only cost efficiency, but also the social and economic impact to the UK whenever possible.

We are moving away from just replacing administrative tasks to the ‘delivery of public services digitally’. This is a much more comprehensive approach which requires an understanding of the social impact of these tools/services beyond the simple GDS indicators (for example, completion rate, cost per transaction).

While it remains important to look at data monitoring on service performance, it is crucial to evaluate what is happening outside of the service to secure accountability and learning if:

the service is doing what it originally set out to do
there are any unintended consequences
we are causing harm
we are leaving anyone behind
it is better than ‘doing nothing’
we can represent a proportional approach to evaluation of AI
we can demonstrate if the cost-efficiency savings are realised unless there is a structured evaluation process in place
it is this the most cost-effective way to deliver the service

Supporting agile principles, our team is structured into performance analysis, evaluation, and value for money (VfM). Performance analysis focuses on monitoring and usability, while evaluation and VfM assess the design, implementation, and impact of DBT digital services to the users and the UK economy.

DBT recognises that more must be done to support accountability and learning. This strategy and playbook, therefore, lay the foundations for the fulfilment of DDaT DBT’s vision for M&E, by applying best practice standards set by GDS and HMT, improving the coverage and use of evaluations, working in agile, and utilising emerging methodologies.

Our vision for M&E in DDaT

Our vision is to:

understand and create the conditions which will allow robust M&E
consistently use evaluation
support the understanding and decision-making regarding which approaches/services work best and why, and, how to improve our products
secure accountability

Ultimately, our role is to enable DDaT DBT to develop and implement the most effective policies, services, and programmes to achieve the government’s objectives.

Question 4

DBT’s DDaT delivering the strategy: playbook

Accepted Answer

There are 4 key elements that will help deliver this vision and these are set out in the following chapters of the playbook.

Ensure digital services development standards meet His Majesty’s Treasury’s (HMT) needs and DBT’s M&E strategy principles.
Ensure all products and services are covered by comprehensive M&E, including the measurement of public/social VfM.
Continue to support evidence led decisions by embedding M&E within agile practices.
Develop best practice M&E methodologies, building on best principles and aligning to GDS Service Manual, Magenta Book, and the Green Book.

Question 5

Section 1: ensure digital services development standards meet HMT’s needs

Accepted Answer

1.1 Introduction

The M&E team in DDaT DBT is formed of a multi-disciplinary team of statisticians (GSS), performance analysts (PA), social researchers (GSR) and economists (GES).

Our role is to:

deliver a comprehensive evaluation and performance analysis programme in DDaT to understand what works, what does not work and why, and to secure accountability and learning
set up the framework for key performance indicators (KPIs), bespoke holistic evaluations, GDS key indicators and the measurement of HMT compliant VfM to support fiscal events

This is integral to understanding how the resources are being used, to estimate the economic impact, support spending review decisions, National Audit Office (NAO) reviews and to secure continuous learning.

1.2 Why M&E is important

Primarily, M&E is important because it contributes to learning and accountability.

1.2.1 Learning

Evaluations can provide evidence which can manage risks and uncertainty. Especially in areas that are innovative or breaking new ground – such as large language models (LLMs) – there is a need for evidence to illustrate whether an intervention is working as intended.

Evaluations can provide evidence to inform decisions on whether to continue a policy, how to improve it, minimise risk or whether to stop and invest elsewhere. Therefore, good monitoring of data is essential for performance management by project boards to keep track of the delivery and implement any necessary changes.

1.2.2 Accountability

Evaluation plays a crucial role in public accountability, providing information about the outcomes and value of the initiatives government has put in place.

Evidence of policy effectiveness is also required for spending reviews and in response to scrutiny and challenges to bodies such as:

NAO
Public Accounts Committee
Select Committees
Infrastructure and Projects Authority (IPA)
Better Regulation Executive/Regulatory Policy Committee
International Development Committee

1.3 GDS Service Standard 10 (SS10)

Government digital products are governed by GDS Service Standards set out in the Service Manual^{[footnote 3]} to help teams to create and run great public services.

SS10 ‘define what success looks like and publish performance data’ stipulates that departments need to measure and publish performance by working out what success looks like for each service and identifying metrics to measure what is working and what can be improved, as well as undertake user research (UR).

The current service standards do not specifically ask service teams to monitor VfM. However, in order to comply with HMT’s Green Book guidance, it is important to also measure this by undertaking appropriate economic appraisals at the discovery stage, and considering factors beyond cost-efficiency to measure the social value of products and services.

As digital teams are moving away from just replacing administrative tasks to the ‘delivery of public services digitally’, we need to understand the social impact of the deployment of these tools and services which requires a more comprehensive and thorough approach. GDS indicators alone, such as completion rate and cost per transaction, are not sufficient.

While it remains important to look at data monitoring on service performance, it is crucial to also evaluate what is happening outside of the service to secure accountability and learning if:

the service is doing what it originally set out to do
there are any unintended consequences
we are causing harm
we are leaving anyone behind
it is better than ‘doing nothing’
we can represent a proportional approach to evaluation of AI
we can demonstrate if the cost-efficiency savings are realised unless there is a structured evaluation process in place
it is this the most cost-effective way to deliver the service

Therefore, DBT’s approach has been to tailor SS10 to reflect HMT’s needs as follows.

SS10 in DBT: ‘evaluate, monitor performance and measure value for money’

Embed the product with the elements needed to monitor and evaluate effectively, including the measurement of Green Book-compliant public/social value for money (VfM).

Work out what success looks like for your service and identify metrics which will tell you what’s working and what can be improved, combined with UR.

Ensure the performance data collected allows for comprehensive M&E including the measurement of public/social VfM.

Iterate and improve your metrics and data collection practices to balance the user needs with the business needs, for internal and external accountability and measurement of Green Book-compliant M&E and VfM.

Collect and use performance data from all channels, online and offline.

Why it is important to measure value for money

Government is committed to continued improvement in the delivery of public services. A major part of this is ensuring that public funds are spent on activities that provide the greatest benefits to society, and that they are spent in the most efficient way.

All new policies, programmes and projects should be subject to comprehensive evaluation. Evaluation examines the output of a project against what was expected and is designed to ensure that the lessons learned are fed back into the decision-making process.

Evaluation helps government make decisions about whether products and services are working to meet the original aims, how they can be improved and whether they are VfM for the UK economy.

Defining what ‘good’ looks like and identifying appropriate metrics means that you’ll know whether the service is solving the problem it’s meant to solve.

Collecting the right performance data means you will be alerted to potential problems with your service. When you make a change to the service, you will be able to tell whether it had the effect you expected.

Publishing performance data means that you are being transparent about the success of services funded by public money. People can make comparisons between different government services.

What applying the revised SS10 means in practice

Service teams should:

work alongside analysts and economists to identify the data collection needed in order to secure the information necessary for evaluation – much of this will be an extension of the mandatory KPIs
identify and draw up the theory of change (ToC) for the product
develop an M&E proposal including the measurement of VfM to meet Green Book principles and, where possible, to demonstrate or isolate the impact of the product or ‘causality’ that the change observed is because of the service and not because of another possible intervening reason
ensure that findings of evaluation are discussed, and the transmission of evidence forms the basis of continuous improvements to the service
identify metrics which will indicate how well the service is solving the problem it is meant to solve, and track performance against them
use performance data to make decisions about how to fix problems and improve the service

Central government services must publish the evaluation reports as per Evaluation Task Force guidelines.

Central government services must also publish data on the mandatory KPIs.

Digital take-up

Digital take-up is the percentage of people using government services online in relation to other channels, for example paper or telephone.

Completion rate

The number of digital transactions that your users complete as a percentage of all digital transactions that your users start.

User satisfaction

User satisfaction about the service at various stages of using it.

Cost-efficiency (cost per transaction)

The total cost of providing the service – including assisted digital support costs – through all channels divided by the total number of completed transactions.

1.4 DBT’s M&E strategy 2023 to 2026 principles

DBT believes M&E will help achieve our vision. Holding ourselves to account, reflecting on our experiences and learning from our mistakes are the department’s core values. Following from this, having a compelling evidence base mitigates the risk of spending public money on activities that do not work as intended.

M&E supports policy and operational delivery, informs our interventions and guides decisions on whether to continue, stop or amend our interventions. This ensures DBT is focusing its resources where they add most value, maximising the impact and cost-effectiveness of our investment of taxpayers’ money.

In order to deliver the DDaT M&E Strategy, we will adhere to the principles set out in the DBT’s M&E strategy 2023 to 2026^{[footnote 4]}:

Principle 1 – use of M&E evidence to make decisions
Principle 2 – strong governance for M&E
Principle 3 – capability building in M&E for all staff
Principle 4 – transparency and accessibility of evaluation evidence
Principle 5 – proportionate coverage for M&E
Principle 6 – high quality standards for M&E

Question 6

Section 2: ensure all products and services are covered by comprehensive M&E

Accepted Answer

2.1 Introduction

To comply with HMT and GDS SS10, our objective is to ensure that all products and services are covered by robust and comprehensive M&E to support each phase of development. Traditional approaches to evaluation may not always be effective as digital products pose new challenges and offer opportunities.

Digital platforms are ideal for delivering ‘smart policy making’ that is constantly self-evaluating. While digital products change rapidly, making it difficult to align with the time it takes to conduct some evaluations, the speed of digital can also enable a faster rate of evaluations.

The regulatory framework for digital products is evolving, with new methods of evaluation being developed and tested. This creates uncertainty around what current best practices are, but also allows for testing new methods and new approaches.

DBT currently operates more than 40 digital products supporting all key policy areas in the department. These cover a range of policy areas, such as business growth, free trade agreements, export, business, regulation, investment support, and include corporate services. The products are in various stages of development (See Annex A) for the current list of DDaT products and services).

To guarantee comprehensive coverage across such an extensive portfolio, we need to adopt a proportionate approach tailored to each product and its characteristics. The decision of what data to collect and what analytical approach to apply is taken on a product-by-product basis; it depends on feasibility, and must remain in accordance with our prioritisation framework, agreed by our portfolio chiefs.

2.2 Broad evaluation types

There is a wide range of evaluation methods. Figure 1 offers one possible categorisation of evaluation types which are used in DDaT DBT. These are not mutually exclusive; all may be present in a comprehensive evaluation as each addresses different evaluation questions, leading to more confident policy making.

Process evaluations

Process evaluations look at the process of implementation and allow for service adjustments and aims to answer questions such as: how did it work? for whom? why? were there any unforeseen effects?

Monitoring or outcome-based evaluations

Monitoring or outcome-based evaluations provide evidence on how far the intended aims, objectives and/or savings are being achieved based on a ToC which outlines the key indicators to track overtime and ideally against a comparison group and against baseline.

Monitoring aims to answer the question of whether the scheme achieved what it was originally set up to achieve by tracking data. Note that monitoring does not explain if the change in the data is due to the intervention. Monitoring does not measure additionality but can provide an indication of direction of travel.

Impact evaluations

Impact evaluations provide important evidence on whether the service worked and whether the initiative was better than doing nothing. These methods typically rely on very good data collection in both a treatment and a comparison group before and after the intervention. Impact evaluations deploy econometrics, experimental and quasi experimental designs to try to answer the question ‘Is the change that I see in the monitoring data because of the intervention?’

Economic evaluations

Economic evaluations aim to establish how far the cost of the intervention was justified by the benefits achieved, ideally incorporating both cost-benefit analysis and cost-effectiveness.

Economic evaluations take the information from all evaluation types and apply Green Book principles to measure VfM: What is the return to the economy for each £1 spent on the intervention? They try to establish whether an intervention constituted VfM, independently from whether or not it achieved a positive effect or a desired outcome.

Figure 1: main evaluation types^{[footnote 5]}

2.3 How we work in agile

To support evidence-based decision-making, we aim to embed a culture of M&E within agile practices, delivering these evaluation methods in different stages of the agile life cycle. Agile set of values, ideals and goals match naturally with M&E principles and best practices. Agile’s popularity is due to the ability to move software development solutions to market more quickly.

Agile’s principles include early delivery, continual development, adaptive planning, and being flexible regarding response to changes. Teams working in agile develop and iterate their products, producing minimum viable products more easily. Gathering feedback and responding to it is part of this dynamic process.

Figure 2: ROAMEF cycle (Magenta Book) – evaluation has a role at all stages in the policy lifecycle

As outlined in the Magenta Book, the ROAMEF cycle (rationale, objectives, appraisal, monitoring, evaluation, feedback) demonstrates that evaluation plays an important role at all stages of the policy cycle.

Figure 3: agile cycle process – for each iteration, the team plans, develops, reviews, and deploys updates to the product functionality

The agile cycle outlines the development stages of each iteration of a digital product. It includes a repeated cycle of planning, design, development, testing, deployment, reviewing, and launching updates to the product functionality.

From this, it is easy to see that the standard ROAMEF evaluation cycle already includes the values of the flexible agile process. In the application of the ROAMEF cycle each stage of the intervention development, from its design to its implementation, is reviewed and tested. Data and information collected at each stage are fed back to improve the policy and move it to the next stage.

By embedding the ROAMEF cycle into the digital product development process, we ensure that:

enough information is available to product teams to inform decision making on product development
the feedback gathered at each stage is relevant
data collection is embedded in the product to allow for its evaluation, and that
progress against objectives is monitored

Continuous and robust monitoring also allows to timely pivot and change the products when needed, minimising the need to rework and the risk of developing ineffective products

Table 1: we merged the ROAMEF cycle into the different stages of the agile product delivery process

Product life cycle	ROAMEF cycle
Discovery	Rationale, objectives and appraisal
Alpha	Monitoring
Private beta/Pre-live	Evaluation and feedback
Live	Continuous feedback

2.4 Evaluation and performance analysis service offer in DBT

The monitoring, evaluation and performance analysis approach is bespoke for each DDaT product and delivery phase, and can range from a minimum viable product (MVP) to a comprehensive product.

The M&E MVP consists of the minimum data collection and evaluation required to meet HMT and GDS compliance requirements as far as possible within product or data constraints.

The comprehensive product provides the most robust evaluation type feasible. Although we strive to achieve the activity set out in the framework, note that this is an ‘ideal’ framework and activities will, at times, happen before or after the stage they are signposted under in the following sections.

2.4.1 Discovery

In this phase, the product teams focus on understanding the problems to solve and the user needs. The initial scope of the project is defined alongside possible solutions. The discovery phase normally involves different professions, depending on the type of product.

Product owners initiate discussions with key stakeholders with the support of user researchers, user designers and other relevant professions (business analysts, content designers, etc). At the end of this stage the team has a clear understanding of user needs and pain points, the scope and requirements for the project are documented, and key stakeholders approve the initial plan of work.

The added value of M&E and PA at this stage is given by setting up the data needs necessary to evidence the project performance and success. Performance analysts set up the data collection goals, ensuring alignment with KPIs as set by the Government Digital Service (GDS).

M&E agree the plan to collect data for VfM measurements, identify baseline measures (or collect baseline data if feasible) and run appraisals. Appraisals should ensure there is a market failure and align with the solutions explored by the product team, identifying the most cost-effective way to deliver the project.

Table 2: M&E activities in the Discovery stage

Discovery analysis	Description	Minimal viable product	Comprehensive product
Appraisal and rationale for intervention	Analysis aimed at identifying what the market failure is and whether the intervention is needed and good use of public resources. This is followed by an assessment of the costs and benefits of all the alternative solutions that could deliver government objectives. This might include baselining the current way of delivering a service.	This analysis is offered at this level of service	This analysis is offered at this level of service
Scoping M&E Plan	The M&E team assesses what analysis and resources will be required to embed robust M&E in the new product, depending on its characteristics. Baselining and the identification of a comparison group might take place at this stage. Includes planning for experimentation.	This analysis is offered at this level of service	This analysis is offered at this level of service

2.4.2 Alpha

In this phase, the product team is focussed on validating concepts and assumptions, develops prototypes and gathers early user feedback to inform the development of the MVP. This phase should see the iteration of various prototypes, the analysis and review of initial user feedback and the definition of the scope of the MVP for the following stage of development. M&E and PA support the activities needed to achieve these objectives.

Evaluators devise the evaluation plan, conduct ToC workshops and develop logic models and scope the evaluation options, including options for impact evaluation. The M&E team will also set up an evaluation advisory group at this stage with representatives from the product, policy and marketing teams. The data needs identified in the earlier stage should be embedded and the measures for VfM should be set up at this stage.

Baselining also takes place if it was not already undertaken in Discovery. This might include surveying the population/users of the current way of delivering a service on a series of topics uncovered by the ToC to include measures such as satisfaction and System Usability Scale if we are replacing an existing digital service.

Performance analysts lead workshops with all relevant teams (product teams, policy, marketing and other key stakeholders) to identify KPIs for the indicators identified in the ToC and determine what tracking and measuring needs to be in place. The findings of the workshop are used to create a Performance Framework, listing the KPIs and how these will be collected.

At this stage, PAs will set up Google Analytics (GA) and Google Tag Manger (GTM) Properties and tag the website with relevant tagging and tracking required to collect data on relevant KPIs and standard metrics. It is made clear in the KPI framework which metrics will be collected via web analytics, and those that will be collected via backend data for the service. The minimum KPIs that we aim to collect for each product align with SS10.

Table 3: M&E activities in the Alpha stage

Alpha analysis	Description	Minimal viable product	Comprehensive product
Theory of change	A visual representation of the relationship between resources, activities, outputs, outcomes, impact (additional impact) and economic impact. This is instrumental to understanding how the service is expected to achieve the desired results.	This analysis is partially offered at this level of service. Basic version only	This analysis is offered at this level of service
Set up the M&E advisory group	An advisory group with representatives from product, policy and marketing is set up to steer and inform the evaluation tools	This analysis is not provided at this level of service	This analysis is offered at this level of service
Key Performance Indicators (KPIs) workshop and Performance Framework	A session aimed at understanding what success looks like and identifying the indicators that will enable to measure the progress against the service’s objectives. KPI’s are then established in a Performance Framework,	This analysis is partially offered at this level of service. Product team only	This analysis is offered at this level of service. All stakeholders
Baselining	Only collecting data prior to the intervention enables to robustly measure the effectiveness of the service. Where possible, this may include the identification of a control group.	This analysis is offered at this level of service	This analysis is offered at this level of service. Including control group
Finalise M&E plan	A detailed plan of the analysis that will be carried out to robustly monitor the service’ progress against the objectives and, where possible, to quantify its impact.	This analysis is offered at this level of service	This analysis is offered at this level of service
Cookies banner, GA and GTM implementation	Ensuring compliant cookie banners and cookie pages are implemented. Setting up Google Tag Manager (GTM) and Google Analytics (GA) to capture GDS indicators and KPIs identified in the Performance Framework.	This analysis is offered at this level of service	This analysis is offered at this level of service

2.4.3 Private beta

At this stage, the MVP is being tested by a restricted group of users and feedback from these users is used to refine the product. The product team focus on identifying critical bugs and usability issues and work to resolve these issues. User feedback in this phase is critical to refine the product features and prioritise implementation.

Performance analysts use data collected to develop a product performance dashboard. This dashboard monitors the KPIs and other critical metrics identified in the previous stage. A combination of GA and back-end data for the service is used to deliver the metrics from the performance framework. Product teams are responsible for pipelining the back-end data into data workspace. For the analytics data, PA work with data engineers to set up pipelines from Google Big Query to Data Workspace.

Where applicable, evaluators and economists may have already set up and run a baseline survey or it may be in the field. As ToC is fully agreed, evaluators focus on process evaluation and set up impact evaluation based on the feasibility options identified in the previous stage.

Results of the baseline data analysis will inform product development, features prioritisation and be linked with user feedback. At this stage M&E and UR should work closely together to ensure quality data and user feedback are fed into product development.

Table 4: M&E activities in the private beta stage

Private beta analysis	Description	Minimal viable product	Comprehensive product
SS10 GDS indicators tracking	These are the mandatory indicators that GDS recommends to measure a service performance: • completion rate • customer satisfaction • cost per transaction • digital take up	This analysis is offered at this level of service	This analysis is offered at this level of service
H-CSAT	The Harmonised Customer Satisfaction Survey is a 5 star exit survey that measures satisfaction consistently across all DDaT products.	This analysis is offered at this level of service	This analysis is offered at this level of service
Performance dashboard	A visual tool that displays the service’s KPIs to monitor performance in real time.	This analysis is partially offered at this level of service. SS10 indicators only.	This analysis is offered at this level of service. All KPIs
Set up the M&E plan	Continue with the data collection tools such as reporting on the baseline survey data, qualitative interview guides for process evaluation, sampling strategies for process evaluation, etc.	This analysis is not provided at this level of service	This analysis is offered at this level of service
Regular performance reporting and understanding the user journey	Performance analysts work closely with the product and user centred design (UCD) team so that analytics insights are used to inform the product design.	This analysis is not provided at this level of service	This analysis is offered at this level of service

2.4.4 Pre-live/Public beta

In this phase, the product team prepare the product for public release, ensuring it meets the performance and quality standards. The product performance is optimised, and stakeholders approve the public release. While the product undergoes final testing and is prepared for deployment, the team work with marketing colleagues to develop a communication plan.

PA use the dashboard to engage product teams and key stakeholders, ensuring continuous learning through monitoring and developing additional measures where necessary. PA confirm that the service is collecting all relevant analytics.

Evaluators and economists will be reporting (and publishing) these analytics were agreed with key stakeholders. M&E and PA provide recommendations based on analysis and ensure that these are being tested and included in the back-log to ensure continuous improvement of the product.

Table 5: M&E activities in the Pre-live/Public beta stage

Pre-live analysis	Description	Minimal viable product	Comprehensive product
Undertake the M&E plan	This may involve a post implementation survey. Process evaluation is used to further understand how the service is working, why and look at unintended consequences	This analysis is not provided at this level of service	This analysis is offered at this level of service
Interim evaluation report (including VfM update)	Report that summarises all evidence to date on the performance and effectiveness of the product aimed at informing decisions on roll out and future improvements	This analysis is partially offered at this level of service. Cost efficiency only	This analysis is offered at this level of service. Full VfM Analysis
Publication of mandatory GDS indicators	Publication of mandatory GDS indicators adhering as closely as possible to statistical code	This analysis is offered at this level of service	This analysis is offered at this level of service

2.4.5 Live

In this phase, the product team launches the product to the public. Once the product is live, its performance is monitored continually, user feedback is reviewed regularly, and improvements to the product are made accordingly. User satisfaction measure, alongside other GDS KPIs, will demonstrate the success of the product.

Performance analysts ensure that dashboards are constantly up to date and discussed at all relevant levels.

In this phase, evaluators run data collection to compare KPIs and key measures against the baseline. When relevant, evaluators run impact evaluation (econometrics exercise, quasi-experimental analysis etc.) and estimate VfM. At the appropriate time, M&E and PA bring together all of the evidence into a holistic deep dive evaluation.

Table 6: M&E activities in the Live stage

Live analysis	Description	Minimal viable product	Comprehensive product
Deep dive every 4 years	A holistic assessment of all evidence to date on the performance of the service. Depending on the feasibility, this can include all types of evaluation, such as: • process evaluation • outcome evaluation • impact evaluation • VfM Evaluation • bespoke research questions • UR findings	This analysis is not provided at this level of service	This analysis is offered at this level of service
Deep dive only with data available	This approach will just gather all evidence available on the performance of the product, Performance Analysis, UR, National Surveys etc. but will not deploy new primary data collection.	This analysis is offered at this level of service	This analysis is offered at this level of service
Dashboard Refinement	Iterate and improve dashboard(s) based on stakeholder feedback	This analysis is not provided at this level of service	This analysis is offered at this level of service
Publication	Publication of the deep dive	This analysis is not provided at this level of service	This analysis is offered at this level of service
Performance board reporting	The M&E team regularly reports insights on the performance of the product to the Governance Board to inform decision making.	This analysis is offered at this level of service	This analysis is offered at this level of service

2.4.6 Throughout the life cycle

In addition to the analysis undertaken within the life cycle stages, there are some M&E and PA contributions that will take place throughout the full life cycle, or at various times depending on service requirements.

Government digital services need to adhere to the Service Standard throughout the life cycle. The Service Standard helps teams create and operate good public services. Services must be assessed if they are the responsibility of a central government department and either of the following apply:

getting assessed is a condition of Cabinet Office spend approval
it is a transactional service that is new or being rebuilt – spend approval will say whether it counts as a rebuild

M&E and PA analysis can support teams with Service Assessments, with a particular focus on SS10 in DBT: ‘evaluate, monitor performance and measure value for money’ (see Section 1). Many of the activities throughout the life cycle will feed into the requirements of SS10, such as creating a ToC, creating a KPI workshop, building performance dashboards, etc. The M&E and PA teams can then bring together this evidence to present to Service Assessment Panels.

Additionally, A/B testing may take place at various stages, most likely later in the life cycle. This is the method of comparing 2 versions of an app or webpage to identify the better performer. The Performance Analysis team are in the process of acquiring an A/B tool and building capacity for A/B testing. Further information on A/B testing will be provided at the end of this workstream.

Table 7: M&E activities that run throughout the life cycle

Analysis	Description	Minimal viable product	Comprehensive product
Support with Service Assessments	PA/M&E can support with Service assessments throughout the life cycle, with a focus on SS10	This analysis is offered at this level of service	This analysis is offered at this level of service
A/B testing	When needed, A/B testing experiments can be carried out to determine which alternative features of the product perform better. A/B testing capacity is currently being built, so is not yet on offer.	This analysis is not provided at this level of service	This analysis is offered at this level of service

2.5 Prioritisation criteria

M&E resources are allocated to DDaT products and projects according to a number of prioritisation criteria.

1․ Cost

This could be cost of the service, the higher the public resources used to deliver a service, the greater the need for robust evaluation or high potential cost-effectiveness or cost-efficiency gains of delivering digitally.

2․ Ministerial and departmental priorities

Analytical resources are allocated in alignment with the department’s priorities and endorsed by DDaT Senior Management Team’s (SMT).

3․ Legislation requirements

Strict legislative requirement might limit the possibility of influencing how a service is run, its priorities or resourcing, whereas the absence of such requirements allows for greater flexibility and influence, enhancing the impact of analytical insights.

4․ Magenta Book^{[footnote 6]} prioritisation criteria

Government guidance recommends prioritising the evaluation of interventions that are high profile, that involve significant risks (to individuals or businesses), raise important ethical issues, uncertainties, and with high learning potential.

5․ M&E impact and evaluability

M&E resources are allocated proportionally to their potential impact on the service outcomes considering, for example:

the service’s scale
whether it is external or internal facing
its value for money
the extent to which M&E has been neglected
data availability
the feasibility of alternative analytical approaches or evaluability

Question 7

Section 3: support the development of best practice methodologies

Accepted Answer

3.1 Introduction

M&E in DBT is formed by a multi-disciplinary team comprised of statisticians, performance analysts, social researchers and economists. Each of the 4 professions brings its own unique capabilities, methods and approaches. These diverse methodologies employed by each profession are complimentary. Joined together, we can provide a comprehensive overview of the product performance.

Performance analysts focus on assessing the performance of digital products, using Google Analytics and other website data, often derived through the KPIs, and can support benchmarks to measure outcomes against targets.

Statisticians bring rigor through advanced data analysis techniques such as inferential statistics and survey methods, ensuring accuracy and validity in the data collection and its interpretation.

Economists contribute through assessment of the broader economic impact and cost-effectiveness of our products, using models and economic theories to predict and evaluate long-term impacts on the UK economy.

Social researchers provide in-depth insight into societal implications and impacts of the products through methods such as interviews, surveys and case studies.

Each profession has its own professional standards:

In our work, we also adhere to the following guidance.

Together, these 4 professions integrate quantitative and qualitative methods and data, enabling thorough evaluation of our products. This allows for assessment of unforeseen consequences and ensures that decision-makers have a holistic understanding of the performance of our products and can take evidence-based actions. Overall, this leads to more effective resource allocation across the department.

3.2 M&E as a DDaT role in DBT

As we have seen in the previous chapters, comprehensive evaluation of digital products is paramount to securing accountability and learning. To this effect we have developed a DDaT capability framework for M&E.

The demand for M&E support for DDaT products has been exponential since we started, providing the necessary information and data to support Public Accounts Committees, NAO reviews, spending reviews and product and policy boards within the department. The skills M&E team offer to DDaT are unique, as government moves to deliver public services digitally the importance of evaluating the social and economic impact are more poignant.

In 2024, the GDS capability framework design council approved a proposal from our team for the addition of a digital evaluator role in the DDaT capability framework meaning that digital evaluators will be able to be embedded in digital teams and deployed across government to secure accountability and continuous learning.

3.3 Capability building

We have a comprehensive capability building offer for our team which covers the Learning and Development (L&D) needs all the way from the onboarding stage to knowledge transfer and handover.

We have nominated L&D leads in the team who frequently organise training courses. These cover skills used in our day-to-day work, such as survey methods, economics, or Google Tag Manager courses. We also attend courses that upskill us on broader topic areas, such as ways of working in agile, as well as line management courses.

We share our learnings and skills with other members of the team in regular knowledge sharing meetings, which contributes to capability building across our professions. We also have peer mentoring programs, through which members of the team learn new skills, such as coding using the R programming language under the leadership of more experienced colleagues.

3.4 Best practice methodologies

Our DDaT M&E team uses a variety of methodologies to ensure comprehensive performance overview of the digital products. Our duty is to keep up with new methodologies and identify new opportunities to spearhead M&E developments.

3.4.1 Technical developments in evaluation analytical techniques

The rapidly changing research and analysis tools landscape, including advancements in Artificial Intelligence (AI), impacts how and what we do. We want to make sure we continue to innovate, so keeping up with those developments is paramount.

Our M&E team keeps up with the latest methodological developments by arranging bespoke training sessions in techniques, regularly attending conferences, webinars and maintaining cross-government relationships with other M&E professionals to learn about cutting-edge techniques in data collection and analysis. Collaboration with cross-government professionals and professional networks and associations, such as Evaluation Society and Social Research Association, also provide access to new knowledge and innovative techniques.

Access to the right tools and having the skills to use them sit at the heart of this. As a team, we regularly review the way we do analysis and discuss the challenges and recent developments in the field and implications this has for our work.

AI provides opportunities for quicker research and evaluation design, including evidence reviews and data analysis, as well as easier and potentially improved use of evidence from evaluation and research. The chart from Government Social Research (GSR) illustrates how AI may contribute to key analytical areas in government. We are committed to exploring the ways in which AI might affect how we do evaluation, as well as pioneering the AI evaluation guidance approaches.

3.4.2 Performance analysis techniques in light of technological advancements

The rapidly changing technology landscape, which includes advancements in AI, changes in search engine algorithms or privacy policies, have implications on the work we, as the PA team, do and require us to:

Stay abreast of technology, industry and PA developments. We need to understand how these developments impact the digital products we analyse. For example, understanding the impact of the phasing-out of cookies from some browsers or the increased need to measure the performance of chatbots.
Have access to the right tools. We need to look for ways to source new tools and techniques to enhance our data collection and analysis. We are currently in the process of procuring a new A/B testing tool to support product teams in setting up more robust experiments.
Have the skills to use the tools and techniques. We need to ensure that all team members have access to sound training, both in technical performance analysis and in broader analytical techniques, including surveys or statistical methods.

To do this effectively, alongside day-to-day work, the PA team has plans to specialise in a number of technical performance analysis domains. This follows the approach taken by some government departments, such as the Cabinet Office (still tbc):

best practice in GA4/GTM
coding (SQL, Python, HTML)
A/B testing and statistics

3.5 Evaluation of AI

As we have seen, AI tools will be important for evaluators as they will provide new analytical possibilities and approaches. However, AI tools and in particular generative LLMs will increasingly be deployed to deliver public services and will also need to be evaluated. AI offers unique opportunities and challenges which may result in unintended consequences.

Robust M&E is, therefore, vital to capture those and understand how to improve delivery. As the only embedded M&E team in DDaT, our team is uniquely placed to identify and propose ways to robustly assess the performance of AI-type policies. This is why we developed the DBT AI M&E framework. The guidance has economic appraisal at its heart and sets out the process of how to robustly measure wider societal and economic impacts of such policies (See Annex B).

In addition, we are working with HMT and Cabinet Office on the development of supplementary guidance to the Magenta Book on evaluation guidance to measure the impact of AI tools. We also supported the GDS GOV.UK chatbot experiment team with the performance assessment of the tool.

We will continue to identify the new opportunities to work with others to ensure effective evaluation of AI tools in the government. We will also continue to share this knowledge outside of DBT and we do this through various channels.

Question 8

Section 4: concluding remarks

Accepted Answer

DDaT DBT takes M&E seriously. We recognise that robust M&E will help us to learn and develop services which have the best chance of delivering the most impact and VfM. We also understand that, while M&E can take up scarce resource and can have a lower priority for ministers and officials engaged in day-to-day delivery, foregoing the evidence and knowledge from robust M&E can pose even higher risk in the long term.

Appropriate and proportionate M&E is also necessary for accountability. In many respects, DDaT DBT is leading the way on the evaluation of digital products. We are the first in Whitehall to have an embedded team of analysts looking at compliance with HMT and ensuring delivery of a comprehensive programme of work.

Having laid the groundwork for a strong culture of accountability and learning, we recognise that it will take time to deliver the strategy in full and for all the benefits to be realised. To make this a success, we will need to make a concerted effort at every level.

This will involve consistent demand and expectation for evaluation evidence from senior managers and decision makers, adequate resourcing and evaluation capability within DDaT, and effective collaboration with product teams, marketing, policy and analysis colleagues across DBT, as well as effective collaboration with GDS and HMT.

Question 9

Annex A: DBT’s service list

Accepted Answer

The service list was accurate as of 2024.

Service name	Delivery phase
Apply for an export certificate / Apply for an import licence (ICMS)	Public beta
Check barriers to trading and investing abroad	Public beta
Check how to export goods	Public beta
Data Hub	Public beta
Data Workspace	Public beta
Delegates Management System / UKDSE Events	Public beta
Digital market access service (DMAS)	Public beta
Digital Workspace and People Finder	Public beta
Employment Agency Standards case management system	Not applicable
Enquiry Case Management System (ECM)	Public beta
Enquiry Management Tool (Ready to Trade)	Public beta
Export support service	Public beta
Finance Forecast Tool (FFT)	Public beta
FTA SharePoint site	Live
great.gov.uk: Digital Answers	Alpha
great.gov.uk: Expand your business to the UK	Private beta
great.gov.uk: Export Academy	Public beta
great.gov.uk: Export opportunities	Public beta
great.gov.uk: Export planner	Public beta
great.gov.uk: Find a buyer	Public beta
great.gov.uk: Find a UK supplier	Public beta
great.gov.uk: Investment support directory	Public beta
great.gov.uk: Learn to export	Public beta
great.gov.uk: Market Guides	Public beta
great.gov.uk: Report a trade barrier	Public beta
great.gov.uk: Where to export	Public beta
Information Risk Assurance Process (IRAP) tool	Private beta
King’s Awards for Enterprise	Public beta
Leaving DBT Service	Public beta
Licensing for international trade and enterprise (LITE)	Private beta
Open Regulation Platform	Private beta
Primary Authority Register (PAR)	Live
Product Safety Database (PSD)	Public beta
Prompt Payments (‘Check when large businesses pay their suppliers’)	Public beta
Regulated Professions Register	Public beta
Sanctions	Alpha
Search for UK subsidies	Public beta
Shared parental leave tools	Public beta
SPIRE	Live
Submit Cosmetic Product Notification (SCPN)	Live
Tariff application platform (TAP)	Public beta
Trade Disputes Sharepoint site	Live
Trade remedies service (TRS)	Public beta
UK Market Conformity Assessment Bodies (UKMCAB)	Public beta

Question 10

Annex B: M&E framework for generative AI in DBT

Accepted Answer

By DDaT Evaluation and Performance Analysis Strategy and Delivery Team (as of 2024)

Background

Generative artificial intelligence is a form of AI – a broad field which aims to use computers to emulate the products of human intelligence or to build capabilities which go beyond human intelligence. Unlike previous forms of AI, generative AI produces new content, such as images, text or music. It is this capability, particularly the ability to generate language, which has captured the public imagination, and creates potential applications within government.

GDS has recently published the Generative AI Framework for UK government. Generative AI has the potential to unlock significant productivity benefits. The framework aims to guide anyone building generative AI solutions to lay out what must be taken into account to use generative AI safely and responsibly. It is based on a set of 10 principles which should be borne in mind in all generative AI projects.

Principle 10 states “You use these principles alongside your organisation’s policies and have the right assurance in place”. Within DBT, an AI governance mechanism has been established. GDS states that departments need to understand, monitor and mitigate the risks that using a generative AI tool can bring. As well as connect with the right assurance, teams in your organisation early in the project lifecycle for your generative AI tool.

Robust M&E to support these AIs is therefore vital to providing that continuous assurance, support the flow of feedback and learning, as well as, to monitor the extent to which the original objectives and key 10 principles are met. To this effect, this paper outlines the issues to consider in order to set up a proportionate evaluation plan for AI projects.

Monitoring and evaluation

Monitoring is the collection of data about the policy or programme. It is used to keep track of the key performance metrics including costs. Data acquired through monitoring is used for evaluation.

Evaluation is systematic assessment of policy’s design, implementation, outcomes and impacts using the data collected. This includes, where possible, the measurement of unforeseen effects and the measurement of economic impact.

Why M&E is important

Two primary reasons are learning and accountability.

Learning

Evaluations can provide evidence which can manage risks and uncertainty. This is especially the case in areas that are innovative of breaking new ground, where there is a need for evidence to illustrate whether an intervention is working as intended, such as a large language model (LLM).

Evaluations can provide evidence to inform decisions on whether to continue a policy, how to improve it, minimise risk or whether to stop and invest elsewhere.

Therefore, good monitoring data is essential for performance management by project boards to keep track of the delivery and implement any necessary changes.

Accountability

Evaluation plays a crucial role in public accountability, providing the information about the outcomes and the value of the initiatives government has put in place.

Evidence of policy effectiveness is also required for Spending Reviews and in response to scrutiny and challenges to bodies such as:

National Audit Office (NAO)
Public Accounts Committee
Select Committees
Infrastructure and Projects Authority (IPA)
Better Regulation Executive/Regulatory Policy Committee
International Development Committee

What is involved in M&E

Before the intervention

Economic appraisal

Any government intervention, including the introduction of AI or LLMs, should be preceded by an economic appraisal. This is the process of assessing the costs, benefits and risks of alternative ways to meet government objectives. A robust appraisal ensures that the intervention is the most effective solution to meet the intended outcomes.

An appraisal is a two-stage process:

long list appraisal – all possible options to meet the policy objectives are assessed against a set of critical success factors. The most effective and viable options are shortlisted
short list appraisal – social cost-benefit analysis allows us to identify the preferred option among the shortlisted ones

The Green Book outlines the principles that must be followed when conducting an appraisal. For some AI projects, the interventions may be proposing savings to the public purse for which a clear baseline of current costs of delivery will be imperative. The evaluation plan can then follow up progress against the chosen delivery option and assess the extent to which the original aims in the business case are met.

Simultaneously, policy and product teams should work with the M&E analysts to begin scoping evaluation design. This should happen in parallel to the intervention design to ensure the right data are collected and the evaluation can deliver useful findings, is appropriate quality and proportionate. This is further discussed in ‘When to engage’ section.

During and after the intervention

There are broadly 3 evaluation types that support each other as key elements of the accountability and learning of any policy, programme or project:

process evaluation
impact evaluation
value for money evaluation^{[footnote 7]}

In government, we have standard guidance that needs to be followed in evaluation and this is set by the Magenta Book.

The evaluation would allow us to assess whether the expected effects are indeed achieved by the intervention. In an ideal world there is a business case for the intervention and economic appraisal. The evaluation will then allow us to assess whether the expected effects are indeed achieved by the intervention.

Process evaluation

A process evaluation:

seeks to answer the question: ‘What can be learned from how the intervention was delivered?’
examines activities involved in an intervention’s implementation and the pathways by which the policy was delivered
is useful to identify any issues with the implementation early in the process
on its own, is not enough to systematically assess the performance of a given policy and so is typically used in conjunction with impact and value for money evaluations

Monitoring

Monitoring:

tracks the data before, during and after the intervention to form the basis of the impact evaluation
typically integrates process evaluation

Digital products will typically be supported by performance analysts that will look at the performance products against GDS indicators. GDS indicators such as digital take up, completion rate (accurate completions), cost per transaction and user satisfaction will be relevant.

AI tools need to develop a performance analysis framework that looks at the performance during the development of the tool in the first instance. Performance analysis (monitoring) benchmarks (indicators) may be relevant for LLMs to better reflect real-world use cases and test models from different angles. This will include accuracy but also considering issues such as speed, toxicity, fairness, security, reputational issues, cost of errors.

Impact evaluation

An impact evaluation:

seeks to answer the question: ‘What difference has an intervention made?’
focuses on the changes caused by an intervention; measurable achievements which either are themselves, or contribute to, the objectives of the intervention^{[footnote 8]}

There are different types of impact evaluations that can be conducted, and the chosen method will depend on the intervention type and data available. The most robust methods will involve some form of experimental or quasi-experimental designs with comparison groups to establish whether observed impacts would have occurred in the absence of the intervention. Where this is not possible before/after approaches are used.

Baseline data is required to enable impact evaluation (see section on monitoring data). However, on its own, impact evaluation cannot assess whether the costs of those changes (outcomes) outweigh the benefits to the UK economy, and so VfM evaluation is required.

Randomised controlled trials (RCTs) might not always be feasible, however, LLMs are particularly well suited to be deployed as small RCTs which randomly allocates some users to the LLM-based tool and others to a ‘control’ or ‘comparison’ group.

RCTs are best suited to measure what would have happened in the absence of an intervention, as well as the impact of any unforeseen consequences, particularly if supported by qualitative process evaluation. All trials will need to collect robust baseline data before being made available to users, monitor what is happening throughout and return to users to learn more after the trial is closed.

Value for money (VfM)

Value for money (VfM):

seeks to answer the question: ‘Is this intervention a good use of resources?’
assesses whether the benefits of the policy are outweighed by the costs, and whether the intervention remains the most effective use of resources^{[footnote 9]}
can be conducted in the form of cost-efficiency, cost-benefit analysis and the valuation of non-market impacts

Tracking of costs associated with the policy is required to enable VfM evaluation. VfM appraisal needs to follow Green Book principles

Figure 4: main evaluation types

Required data

Where policies are part of the wider offer in that space (for example, where multiple digital and in-person services are part of the DBT ‘export promotion offer’), it is recommended that consistent data is collected across all services to allow for complete M&E. See example for data recommendations for export promotion products and services in Annex A.

Data on a treatment and comparison (control) group before the service/tool is being made available to users, monitor what is happening throughout and return to users and control group to learn more after the treatment to measure impact

Where in the user journey the data would be captured needs to be assessed?

The data may be captured at the initial registration process or during the completion of the export plan/other platform services.

Stages of evaluation

1. Evaluation scoping

This involves developing a ToC which facilitates thorough understanding of the intervention and how it is expected to achieve desired outcomes. It is important here to capture the aims and objectives set out in the original business case.

2. Evaluation design

This involved identifying the right evaluation approaches that will meet the learning goals.

3. Evaluation methods

This involves deciding on data collection methods and analysis; complimentary methods should be used where possible (mix of quantitative and qualitative data).

4. Conduct the evaluation

This involves executing evaluation and modifying in response to learning and policy changes/stakeholder needs.

5. Disseminate, use and learning

This involves preparing final evaluation outputs, disseminating and actioning findings.

AI project life cycle

There are many proposed life cycle options covered in the literature. Majority cover the following themes: Problem definition; data acquisition and preparation; Model development and training; model evaluation and refinement; deployment; machine learning operations.

The key distinction relevant for evaluation lies in the:

evaluation of the development phase: need to consider here what the right performance indicators should be to monitor and evaluate. Accuracy is important but also need to consider issues such as, speed, toxicity, fairness, security, reputational issues, cost of errors. Also in the testing phase, how do we better reflect real-world use cases and test models from different angles. Baselining is relevant to this stage, including the current cost of delivering the service
evaluation of the deployment phase: here evaluators will be interested to set up and evaluation plan to understand the social and economic impact of the tool. Questions will include: Is it better than doing nothing? Are we able to deploy the AI tool as an RCT to be able to truly compare against the current model of delivery? What will be the effect in the population overall? What is the error rate? What are the unforeseen effects?

Table 8: the AI life cycle

Stage of the AI life cycle	Activities
Business and use case development	Problem/improvements are defined and use of AI proposed. Outline how development phase will be tested.
Design phase	The business case is turned into design requirements.
Training and test data procurement
Building	AI application is built and tested (Evaluation of the development phase)
Deployment	AI system goes live. M&E team works with deployment team to assess deployment options that best secures robust evaluation.
Monitoring and evaluation	Performance of the system is assessed against key indicators, original business case and for unintended effects.

When to engage

M&E Analysts should be engaged early in the policy development cycle.

The HM Treasury Green Book on what to consider when designing evaluations^{[footnote 10]} suggests that building evaluation into an intervention’s design is a critical way to ensure that:

the evaluation delivers useful findings to those designing and implementing policies
the evaluation is of appropriate quality for its intended use
the right data are collected in the most cost-effective and efficient way possible

The relationship between an intervention’s design and evaluation usefulness is crucial. Small changes in an intervention’s design can make the difference between a high-quality, useful evaluation and one that is not able to answer the key questions under consideration (does it work? to what extent? for whom? and why?)^{[footnote 11]}.

Where to begin

Before policy implementation, ensure that:

what was the outcome of the case you put forward to the DBT AI governance committee
was a budget allocated for M&E
you have an analyst to lead and set up the framework
you have an agreed framework that meets the needs of your project
you have an established baseline data/ counterfactual comparison group (to answer the question ‘what would have happened in the absence of an intervention’)

‘Good’ AI evaluation

Good evaluation is one that is fit-for-purpose; proportionate and reflects the needs of decision-makers and those scrutinising the policy. Proportionality will be guided by a range of factors including, likely cost-efficiency gains, unintended consequences, ethical issues, strategic importance, cost and other risks. Evaluations should be useful, credible, robust and proportionate.

While general M&E framework can provide some useful principles on what good evaluations should involve, this will generally be decided on a case-by-case basis, depending on the intervention design and objectives.

The same principles will apply to evaluations of policies which use AI.

The information provided in this paper is intended to be used as a guide. Full evaluation approach should be discussed and agreed with key stakeholders to ensure it meets the key learning objectives.

Attention should be given to identifying essential evaluation data including the data collected at the user registration stage and costs associated with the policy delivery as this could determine the robustness of evaluation that can be carried out.

The section on data collection in this paper should be used as a guide on what type of data might be relevant for the evaluation of business support policies utilising AI. The decision on how and when in the user journey this data is collected should be decided at the policy development stage in a discussion with M&E analysts due to evaluation implications.

Example of M&E data needs for export support products and services

In general, in order to have a complete view of the platform users, be able to segment and personalise the service, triage and fully inform VfM calculations, the following data should be captured (see Annex A for details):

Business characteristics

business name – check against Companies House data
business size (turnover and number of staff)
business region (Head Office and Business Unit) – need to understand both for levelling up
countries currently export to

Triaging information

sector
segmentation 6 golden questions based on segmentation of exporting businesses^{[footnote 12]}
what are the themes – compliance with rules, documentation, costs?
which markets are causing most issues?
what is the value of exports being enquired about?
understand the context of the enquiry (What is their need?)
what is the specific task they are trying to do (What is their intent?)
what phase of the export journey is the enquiry pertain to (regardless of whether the business has exported before)

Input measures

Total cost associated with the delivery of the intervention (incl. staff and non-staff cost)

Output measures

DBT services accessed
if products used or interaction team that dealt with it
type of interaction/SD/product
outcome (was the enquiry successfully resolved and if so, how?)

Data coming from other sources

user satisfaction and reported impact

Note that while these are recommended to be captured, the actual data to be captured should be discussed on a case-by-case basis, to ensure that data collected is aligned with the evaluation learning objectives.

As set out in the DDaT Strategy 2021 to 2026, see Priority 4, Governance and Working. ↩
DBT (2022) DBT’s monitoring and evaluation strategy. ↩
GOV.UK Service Standard. ↩
Department for Business and Trade (2022) Driving our performance: DBT’s monitoring and evaluation strategy 2023 to 2026. ↩
Note these are not mutually exclusive and a holistic evaluation would normally deploy all 4 types of evaluation to triangulate. Where possible, we will also build upon emerging methodologies such as contribution analysis techniques, theory-based evaluations, meta-evaluation techniques, which can take elements of all or some of the evaluation types in the chart. ↩
The Magenta Book is the UK government guidance on what to consider when designing an evaluation for a public intervention. ↩
HM Treasury (2011) Magenta Book. ↩
HM Treasury (2011) Magenta Book. ↩
HM Treasury (2011) Magenta Book. ↩
HM Treasury (2022) The Green Book (2022). ↩
HM Treasury (2011) Magenta Book. ↩
Department for International Trade (2020) Segmentation of UK businesses research. ↩

Cookies on GOV.UK