Measuring the legacy and impact of major events: a toolkit

Question 1

Aim of the guide

Accepted Answer

This Guide sets out the suggested approaches to the legacy impact assessment of major events with the aim of supporting practitioners in planning and delivering their evaluation while also providing background for those commissioning or utilising the findings from evaluations. In short, legacy can be defined as impacts that persist a significant period after the event. There is no one way to evaluate the legacy impacts of a major event and the choice of approach will depend on many factors including the characteristics and objectives of the major event as well as the data and resources available to the evaluator. Legacy impacts of major events are difficult to prove in practice for a range of reasons, with limited examples of legacy being shown with robust evidence in the previous literature. This guide will help navigate the considerations, risks and actions to design and deliver a robust assessment of the legacy impacts of major events. Emphasis is given to the practical considerations and will offer a view of best practice acknowledging the limitations and constraints that event organisers and event evaluators face.

Event evaluators must compare different methodologies and select the ones that are most appropriate for their event. This point implies (and we wholeheartedly agree) the meaningful legacy must be bespoke to specific events, which therefore requires clearly stated objectives; which in turn requires a programme Theory of Change (ToC) to underpin it all.

Although major events are evaluated regularly in the UK and around the world, there are relatively few examples which take a long term view and evaluate the legacy of major events more than a year or two after the event has closed. Even fewer of these examples use counterfactual methods and achieve SMS or NESTA level 3 which require methodologies such as difference in difference. This guide therefore aims to provide support to future evaluations to capture these longer term impacts with robust methods.

Major sport and cultural events are neither inherently good nor bad. They do however create a platform from which desired positive impacts can be achieved, so long as they are planned for and leveraged correctly. It is overly simplistic to assume that events deliver desirable outcomes simply by virtue of being staged. Therefore, it is important to be clear about what outcomes the staging of an event is designed to deliver. Good practice indicates a five staged approach as shown in the bullet points below.

State the objectives that are relevant to the stakeholders to whom you expect to report.
Develop a ToC to identify the inputs and activities which will enable your desired outcomes.
Identify the target markets and segments of the affected populations on whom the event is intended to make an impact. This stage is necessary to enable data gathering to be specific to the relevant targets.
Choose the appropriate Key Performance Indicators (KPIs) from the menu of KPIs provided and where necessary devise locally bespoke measures that are consistent with the objectives
Implement relevant data gathering before, during and after the event to ensure that what is collected enables effective analysis of the extent to which objectives were achieved following an evaluation plan and data strategy.

Question 2

1. How to use the guide

Accepted Answer

The guide is designed to be used by a range of users to help support the design and implementation of evaluation methodologies to measure the effectiveness of major events in meeting objectives with particular focus on legacy. This includes outlining the decisions that need to be made and with consideration of both the practical and methodological limitations. The guide is split into sections where you can find the information you need:

Section 1: What is legacy?
Provides an overview of how legacy has been defined including the types of outcomes included in Legacy using the Preuss Legacy cube^{[footnote 1]}. The section also includes an overview of the reasons why legacy is difficult to measure.

Section 2: Designing the evaluation plan
A step by step guide for designing an evaluation plan from defining the objectives of a major event through to selecting the methodologies most appropriate to measure the impacts of a major event based on the specific characteristics of the event.

Section 3: Designing the data collection plan
Every evaluation requires data. This section provides an overview of the data needed depending on the objectives of the major event. Suggestions are also provided of commonly used data sources.

Section 4: Specific issues in evaluating legacy impacts
Measuring legacy comes with a number of challenges including confounding effects, changing administrative boundaries amongst others. This section builds on section 1 to set out these challenges in more detail, what their impacts are and how they can be overcome in the design of the evaluation.

Section 5: Toolkits
Building on the overviews of the different methodologies in section 3 the toolkits set out in more detail how the methods (Difference-in Difference approaches, propensity score matching and synthetic controls) can be incorporated in an evaluation. The methods identified are those that would reach level 3 on the Maryland SMS scale which requires a comparison of impacts before and after an event utilising a counterfactual. Each toolkit answers the questions:

What is the technique?
What does it set out to achieve?
What data do you need and where do you get it from?
How do you implement the technique?
Where and when has it been used in our field previously?
What are the pros and cons?
When is it appropriate to use it?

Figure 1: Event Evolution. [Source: Gratton & Preuss 2008]

1.1 What are the legacy outcomes of major events

Legacy is a relatively recent concept used in both sporting and cultural events that garnered popularity following the 2008 and 2012 Olympics and the European and UK City of Culture initiative. A widely accepted overarching definition of event legacy coined by Preuss (2007) considers the planned and unplanned, positive and negative, tangible and intangible structures created for and by an event that remain longer than the event itself. The diagram below illustrates that the three dimensions of legacy - the degree of planned structure, the degree of positive structure and the degree of quantifiable structure - form a ‘Legacy Cube’. The Legacy Cube consists of eight sub-cubes and a holistic evaluation of a major event would be needed to identify all legacies.

Figure 2: Legacy Cube. Source: Adapted from Preuss (2007).

There are existing frameworks and toolkits available in the public domain that provide guidance on measuring a range of impacts and legacies associated with major events. Notable examples include eventIMPACTS.com (endorsed by DCMS, Welsh Government, UK Sport, EventScotland and Tourism NI); the Association of Summer Olympic International Federations’ (ASOIF) Common Indicators for Measuring the Impacts of Events ; and, two guides produced by the Organisation for Economic Cooperation and Development (OECD): (1) How to measure the impact of culture, sports and business events and (2) Impact indicators for culture, sports and business events . As demonstrated in the table below, there is considerable overlap in the broad themes that are covered, which includes a mix of short-term impact indicators and long-term legacy outcomes.

Table 1: Themes covered by existing frameworks and toolkits

eventIMPACTS	ASOIF	OECD
Attendance	Economic (4 areas)	Economic (6 areas)
Economic	Image (4 areas)	Social (5 areas)
Social	Social (6 areas)	Environment (5 areas)
Media	Sport (2 areas)
Environmental	Environmental (6 areas)

1.2 Why are legacy outcomes difficult to attribute to major events?

Evaluating the legacy of major events poses several methodological challenges due to the complexity of measuring their impact on the various facets of legacy. In this guidance we set out how to overcome these challenges in your evaluation to estimate legacy impact as accurately as possible. It is important to note that in practice, it is nearly impossible to remove these issues but we should instead aim to reduce their impact on our estimates. The major challenges are detailed below:

Longevity

Determining the true impact of a major sporting event often requires assessing its effects over an extended time period, sometimes spanning years or even decades. Tracking long term changes in dimensions such as infrastructure, tourism, economy, and societal attitudes is challenging. Furthermore, impacts can occur at various points in time and are said to be ‘lagged’, which complicates their measurement. Some effects, like short-term economic spikes or improvements to infrastructure, might be more immediate; while others, like shifts in societal attitudes (e.g. attitudes towards people with a disability) or long-term economic impacts can take many years to occur. The majority of the existing research relates to periods during and immediately after events, with few papers focusing on either the preparation phase or the long term.

Attribution

Untangling the effects of a major event from other concurrent social, economic, or political changes (known as confounders) is difficult. With the passage of time, it becomes even more problematic to establish a direct cause and effect relationship between hosting an event and legacy outcomes as more confounding effects occur. Major events are often purposefully timed to coincide with other policies and initiatives to enhance the legacy of the events and should therefore be identified as part of the evaluation.

Data

Collecting and collating robust data before and after an event is essential to understand the direction and magnitude of any changes over time. However, obtaining accurate data that is fit for purpose is often problematic because resources for evaluations are limited and often are not commissioned in sufficient time to establish appropriate baselines. All too often the assessment of legacy is ‘bolted on’ to an event rather than ‘bolted in’, which results in compromises being made on methodological rigour and the scope of what can be achieved. This guide therefore suggests, as far as possible, the used of available secondary sources. Administrative data such as data from ticket sales could also provide a useful source of information.

Defining a Target Area

The area over which impacts of the event are felt is difficult to define. Events are likely to attract visitors from outside the immediate town, city or country holding the event. Impacts on participation are also likely to be spatially spread when events are televised. Similarly, major events are likely to impact businesses outside the local area, particularly where there is infrastructure investment. Care also needs to be taken when events have a predominant centre but also smaller events happening in other locations for example sailing events held in Weymouth during London 2012 Olympics and Paralympics.

Subjectivity

Measuring legacy can rely on subjective inputs, such as public opinion, short term attitudinal changes and perceptions. Quantifying these subjective aspects in a standardised manner, for example social value measurements, is subjective and technically demanding and there are no ‘off the shelf’ solutions.

Metrics

Deriving and agreeing appropriate metrics for impact evaluation is essential to capture ‘what changes’ and ‘to what extent’. Ideally, metrics should be based on objectives (i.e. planned outcomes) and hence may differ from event to event depending on the nature of the objectives. Determining which metrics are most relevant and how to measure them consistently remains a challenge. Despite the existence of various event impact evaluation frameworks, there is no consensus on metrics. Often research designs lack the sophistication necessary to do justice to the claims being made. For example, the dominant survey data collection techniques tend to be repeated cross sectional designs rather than longitudinal (panel) studies, which in turn limits the opportunity to determine causality.

Noise

As events take place in open systems, ‘noise’ such as economic conditions, geo-politics, or pandemics, can materially affect an event’s legacy, making it challenging to attribute the occurrence of specific outcomes entirely to an event.

Bias

It is common for evaluations (notably within the grey literature) to overestimate the positive impact of major events and to understate or even ignore the negative impacts. This point is particularly pertinent in the pre-event stages when advocates are bidding for funding to stage the event and tend to create a degree of optimism bias.

Question 3

2. Designing the Evaluation Plan

Accepted Answer

2.1 When to start the evaluation design

As detailed in the HM Treasury Magenta Book, evaluation should inform thinking throughout the life of the event, including before, during and after^{[footnote 2]}:

Before

Evaluation can be used to inform the design and implementation of the event, drawing on existing evaluation evidence and aligning the delivery to best achieve the desired outcome and impact (utilising a ToC, see Section 2.2). It is also important to start planning your major event evaluation with enough time to enable the measurement of impact through the event cycle. Once an event has been announced, impacts can already start taking place which will be missed if baseline data collection is not in place. It is suggested that any data collection starts before the event is announced which could mean beginning to plan your data collection 6 to 12 in advance of the event. If the event is part of a competitive process, data collection would ideally start before the event bid has been announced however in many cases this will be infeasible given that the event is not certain at this point and therefore funding for evaluation will not be in place. We therefore suggest to start any data collection as soon as possible and acknowledge the potential for missed impact within the evaluation report. Data could also be added to existing processes such as adding questions to regular surveys to reduce burden and costs.

During

During the event is where evaluation can present the greatest opportunity to influence the delivery of the event and help ensure that intended benefits are realised. Evaluations undertaken during an event would likely seek to understand how efficiently the event is being delivered/ is the event being delivered as intended, can the design or deliverability be improved (either during delivery or lessons learnt for future delivery), are there any signs of emerging outcomes?

After

Once the event has finished, it is possible to understand the impact of the event in terms of the benefits delivered, and the extent to which the benefits of the event exceeded the costs. This could include an overall appraisal of the success of the event against its objectives, contribution (as well as effect size and significance) the event made towards its intended outcomes, unintended and/or negative impacts, whether the event affected demographic groups differently and whether the event represented value-for-money.

Whilst the accompanying case studies used to illustrate this Toolkit in practice are estimating impacts and potential legacy effects after the event, it is recommended that future publicly funded events held in the UK evaluate across the entire lifetime of the event, following the principles set out in the HM Treasury Magenta Book, to ensure best practice evaluation.

2.2 Theory of Change

A Theory of Change (ToC) can be used to synthesise the inputs, outputs and anticipated outcomes of the event, modelling how an event is expected to achieve desired outcomes and benefits. This not only helps in understanding how an event is intended to work but also how an event should be measured and evaluated.

The HM Treasury Magenta book provides guidance as to how to develop and utilise a ToC. a ToC typically involves considering the proposed inputs and the causal chain that leads from these inputs through to the expected outputs and outcomes. The diagram below sets out a framework to follow when setting out the causal mechanisms by which an intervention is expected to achieve its outcomes.

Figure 3: Theory of Change (ToC) framework

Developing a ToC will typically involve the stakeholders involved in designing and executing the intervention. This can be through workshops or consultations. Alongside this, research methods, including evidence synthesis, focus groups, and expert panels, can be used to gather and synthesise evidence to use in its development. To help, we have provided a set of ToCs which can be used as a basis for events to develop their own ToCs. Each event is likely to have its own specific objectives, however the ToCs included within this Toolkit generalise these and therefore include the objectives, inputs, activities, outputs, outcomes and impacts that we would expect, and most commonly see, from major events. It is anticipated that practitioners would tailor the logic model included within this Toolkit around the major event they are evaluating.

When considering the outcomes in the ToC it’s important to remember that outcomes will arise, build up and potentially decay over different time periods. Additionally, outcomes will be felt by different groups and spatial areas.

Instead of developing one overarching ToC we have produced five theories of change across different outcome types, though it should be noted that activities can create impacts, outcomes and spillovers that span multiple outcome types. The groups were decided upon by synthesising previous evaluation framework indicators and asking Large Language Models to cluster and group them.

The diagrammatic ToC for each of the five categories below can be found in Appendix 1:

Cultural and social Impact
Economic and Employment Impact
Reputation
Health and Wellbeing
Environmental Responsibility and Accessibility

2.3 Choosing indicators

Whilst the ToC identifies the anticipated outcomes and impacts of an event, indicators provide a means of measuring the outcome and impacts – and therefore provide a basis for evaluating the extent to which an event has achieved its objectives.

This is a key initial step as the choice of econometric methodology to be used to establish causality is dependent on the availability of data. Event objectives may change through event planning, delivery and even post-event and therefore indicators may well need to be revisited and the evaluation plan updated throughout the evaluation.

As identified in Section 1.2, there are three frameworks which provide guidance on indicators which can be used to evaluate the outcomes of major events:

Event Impacts^{[footnote 3]}
The Association of Summer Olympic International Federations’ Common Indicators for Measuring the Impact of Events^{[footnote 4]}
The Organisation for Economic and Co-operation and Development guide to Impact indicators for culture, sports and business events^{[footnote 5]}

Section 2.4 below sets out the scoping of data sources within the context of the evaluation plan, whilst Section 3 offers a discussion on primary data collection and availability of secondary data sources.

2.4 Scoping data sources

At the start of the planning stage a decision on the specific datasets is not required but it is worth assessing the available data, in particular the level of aggregation, spatial coverage and variables as this will impact the choice of counterfactual design, research design and estimation method. The annex includes a bank of commonly used datasets with key information including where the data is available from, and Section 3 includes a discussion on primary and secondary data sources. It is worth considering that to achieve a difference in difference approach you will need to have access to data in both intervention and comparator areas. Therefore, the more you are reliant on primary data collection to measure your outcomes of interest, the more expensive it will be to undertake counterfactual analysis as you will need to collect data in both intervention and comparator areas. We therefore suggest using secondary data where possible and to consider the trade-off between a less specific existing dataset that could enable more thorough analysis against a more specific primary data collection that might limit the analysis that can be undertaken both in terms of counterfactual and timeframe.

2.5 Choosing the target area(s)

In order to assess the legacy and impact of major events, we must understand the people, places and area(s) which are likely to be affected by the event.

As set out in Section 1, major events can take place within a town or city, at a regional level or nationally across multiple areas – we refer to this as the event area. However, the area where the impacts of the events are realised may be much larger than this. For example, events may attract visitors from outside of the specific city or region creating wider inspiration effects on participation than the event area. Alternatively, the event may attract media coverage further increasing the reach of the event, and therefore should be considered when defining the target area of the event.

It is also likely that the target area may be different between indicators, for example economic, employment and tourism impacts may be more concentrated on the area around the event, whereas changes to participation and a sense of pride may occur nationally.^{[footnote 6]} Practitioners should consider the extent to which different indicators require different counterfactuals to be identified and econometric methods to be used.

Caution should be taken when establishing a target area as impacts will likely become more diluted the further you move from the event area which will impact the ability to establish a causal relationship between the event and chosen indicators. This dilution of effects is further exacerbated when assessing legacy impacts over the long term as the causal relationship becomes even harder to establish. We therefore suggest being conservative in selecting target areas for indicators based on the anticipated magnitude of changes in indicators.

Example from the Literature

The absence of a well-defined target area is a common challenge in major event evaluation. Whilst we were unable to find applicable examples specific to the evaluation of major sporting and cultural events, examples can be found from evaluations of other spatially targeted interventions where well-defined boundaries are not available.

Nathan (2019) evaluated the Tech City Programme – a programme which aimed to grow a localised group of technology companies (of approximately 2,800 firms) centred on the Shoreditch and Old St roundabout. The Tech City Programme included a range of interventions including place branding and marketing, business support for targeted firms, tax breaks for early-stage investors and attempts to improve firm to firm co-ordination.

Given that there was no clearly defined target area or boundaries in which beneficiaries of the Tech City Programme were expected to lie, Nathan (2019) defined the target area as Lower Super Output Areas (LSOAs) within a 1km ring around the Shoreditch and Old St roundabout (see figure below from Nathan (2019)). Nathan (2019) further divided the target area up into ‘distance rings’ increasing in increments of 250m from the Shoreditch and Old St roundabout. In order to calculate the distances and define the target area, Nathan (2019) used the centroid of the LSOA as the basis for estimating the linear distance from the Shoreditch and Old St roundabout.

Figure 4: Diagram showing use of target areas. Source: Figure 1 from Nathan, M. (2019) Does Light Touch Cluster Policy Work? Evaluating the Tech City Programme

2.6 Defining a Counterfactual

To understand how a major event has created impact (either in the local area or on a more national scale), the people, places and area(s) affected by the programme need to be compared against the scenario where the major event was not held – the counterfactual scenario.

Major events don’t happen in a vacuum. Numerous other factors including economic trends, political changes and social shifts are always at play. A counterfactual helps us disentangle the specific influence of the event from these external factors, revealing the true impact of the event. If outcomes were measured before and after the event without a counterfactual we would not be able to understand whether any change in the outcomes was because of the event or any other confounding effects.

Given that the major event was held, the counterfactual of the event not being held is not observable – it is a hypothetical scenario that does not exist. To overcome this, there are a range of potential methods that can be utilised to estimate what would have happened in the absence of the event. We therefore want to select a counterfactual which is similar to the target area in all aspects other than holding the event. Similarity could be based on many variables, most notably size, economic output, population demographics and indicators or other social welfare indicators (e.g. engagement in sporting or cultural activities).

The remainder of this section explores different approaches to define counterfactual areas, as well as considerations to ensure that the appropriate counterfactual is defined.

2.6.1 Considerations in defining a counterfactual

Selecting an appropriate counterfactual is an important step in limiting the problems associated with measuring the legacy and impact of major events identified in section 1.3. At the forefront of these challenges is attribution where social, economic, or political changes affect our indicators of interest. It’s therefore important to identify a counterfactual which will be similarly affected by these events. The best way to do this is identifying counterfactuals with similar characteristics such as similar population demographics and geography. There may also be discrete changes such as other planned major events that are known when selecting a counterfactual, unless we can build them into our evaluation (see section 2.6.3 explaining the use of an event series) it’s best to avoid using their areas as counterfactuals.

2.6.2 Type of counterfactual area

We want to identify a counterfactual area (or combination of areas) with similar characteristics to our target area. For example, if we are evaluating an event where the target area is a city, the best counterfactual area would be another similarly sized city with similar characteristics.

Another consideration when choosing the type of counterfactual is data availability. Domestic counterfactuals are likely to pose less of a challenge as national datasets can be used to track indicators more consistently. There are a range of openly available datasets that can be used across the UK which include typical indicators used in event evaluation. However, there are far fewer in an international context and therefore the indicators you can use may be limited or primary data will be needed.

2.6.3 Methods for counterfactual matching/selection

There may be a significant number of potential counterfactual options. Below includes a number of methods for selecting an appropriate counterfactual with more detail provided in section 5 on how these methods can be applied in practice:

Synthetic control: A weighted average of all possible control cases – i.e. a synthetic counterfactual. The synthetic control uses data to describe the evolution of the outcome variable of interest before and after the treatment (or event) and a set of control variables describing other features that are thought to be predictors of the outcomes of interest. This is a better option when longitudinal data is available for multiple years before the event and the event is a one-off single event (or small number of cases) receiving the treatment of interest – e.g. a single city hosting an event.

Propensity score matching: Geographical unit from the treatment group are ‘matched’ with members of the comparison group that share a similar estimated probability of being within close proximity to the event host area (i.e. the propensity score). Where data is available at a more disaggregated level (e.g LSOA) the treatment and control area would be broken into geographical units (informed by available data and expected impacts). Where data is available at the event area level (e.g city or region), propensity score matching can be used to identify the most relevant counterfactual cities or regions. Propensity score matching is an appropriate option when using a domestic counterfactual and necessary data is available for relevant variables across a sufficient number of matched geographical units.

Manual selection: Instead of using statistical methods, counterfactuals can be selected based on knowledge of the areas and their similarities. We only suggest this is used where above methods are not possible due to unavailable secondary data, this may be particularly apparent when using international case studies. Equally, the evaluator can use discretion to manually remove counterfactual options where data has identified a match but in reality there are reasons why they are unsuitable e.g., political sensitivities or local policy knowledge. This is an appropriate option when data is unavailable to use statistical methods.

Unselected bids from a competitive process: Restrict comparisons only to unselected cities (or areas). It could be argued that unobserved characteristics (e.g. similar levels of motivation, rationale for applying, ability to deliver, etc.) are less likely to differ between successful and unsuccessful bidders compared to a successful bidder and non-bidder – reducing the level of potential bias and therefore producing more robust results. This could also be used in conjunction with the matching models described above to control for pre-intervention characteristics and increase the robustness of the analysis. However, unselected sites may also get benefits from going through the competitive process^{[footnote 7]}, for example businesses may start anticipating a successful bid and expand in the area or the award process itself may give benefits to unselected areas (which would therefore underestimate the impact of the event). There may also be political sensitivities in including an unselected bid as a counterfactual that should be considered.

Using an event series: In situations where major events move between host geographies within the UK year on year, a pipeline design could be deployed to exploit the long timeframes over which the events are held. In this set up, those that host the event later would serve as a comparison group to those that hosted the event earlier. As comparisons are restricted to cities (or areas) that host the same major event, estimates should be robust to problems caused by systematic differences between cities (or areas) that host an event and those that do not. Other counterfactual approaches that rely on using non-treated areas may exhibit biases as unobserved characteristics between the treated and comparison groups cannot be controlled for. However, within a pipeline design, there is greater confidence that these unobserved biases are reduced given all units were eventually treated. For example, the UK City of Culture moves around different host cities within the UK. Small geographical areas around Kingston upon Hull, Coventry and Bradford would serve as controls for small geographical units around Derry^{[footnote 8]}. Comparisons could then be made between treatment and comparison units, where it would be expected that there would be minimal systematic differences between all host cities, Derry in 2013, Kingston upon Hull in 2018, Coventry in 2021 and Bradford in 2025.

Natural Experiments: There may be occasions where an opportunity arises to use a natural experiment to understand the impacts of major events. A natural experiment refers to the idea of exploiting naturally occurring factors beyond the control of the researchers to explore key variables of interest. Natural experiments (to some extent) mimic a controlled experiment in that one group is affected by a change, and another is not. A comparison of the two affected groups provides the basis for understanding the impact of the naturally occurring factors.

One example of a natural experiment could be the decision the English FA made to use the Millennium Stadium in Cardiff from 2001 to 2006 whilst Wembley underwent refurbishment. This may allow for the exploitation of time series data to test the impacts of hosting recurring sporting or cultural events through shift in hosting responsibilities from one stadium/ venue through temporary closure, to another stadium in a different city. This could allow the causal effects to be isolated, compared to contemporaneous investment which may not have the same start/end date. Furthermore, the persistence of these effects could further be tested.

2.6.4 Alternatives to spatially defined counterfactuals

In some cases it may be appropriate to explore alternatives to a spatial counterfactual, either because:

The treatment effect involved is expected to arise partly or primarily through the national media coverage of the event rather than through more localised channels, and/or:
There is particular interest in the wider impacts of the event on particular demographic groups.

For example, the impact of the European Women’s Football Championship 2022 would likely want to explore the extent to which the event had:

Generated wider interest in the women’s game and perhaps in women’s sport as a whole - as indicated, for example, in increased attendance at professional matches, viewing figures for televised matches and associated positive effects on the finances and viability of the professional game;
Generated increased participation in the amateur game by women and girls at all levels – with issues of how far this was associated with overall increases in participation in physical activity or involved displacement of activity from other sports.

Untreated populations or demographic groups would therefore provide the counterfactual for example participation in other women’s sports or participation in men’s football might provide a useful potential comparator.

2.6.5 Choosing a counterfactual approach

As shown, there are several options to choose from when selecting a counterfactual approach. The following diagrams provides an outline of the main considerations and how they can lead you to choosing an approach.

Figure 5: Decision map for selecting a counterfactual approach

2.7 Model Specification

This section sets out suggested options for the model specification. We suggest using Difference in Differences (DiD) which examines the difference in the outcome over time in the comparison group, and subtracts this from the difference in outcome over time in the treatment group (i.e. an area local to a major event). Information to help choose the type of DiD approach is set out below, further details on how to apply the methods in practice in section 5:

2.7.1 Staggered Difference-in-Difference

In the case where an event moves between different cities over time or around a host city (creating several ‘treated units’ at different points in time) a staggered DiD could be deployed. This set up is robust to potential biases arising from comparisons between later and earlier treatment groups, and from heterogeneous treatment effects across groups. A major benefit of staggered difference-in-difference is that the method uses future events or events areas as a counterfactual which strengthens its applicability as a control.

2.7.2 Fixed Effects Difference-in-Difference

There is a risk that unobservable differences could bias comparisons between the treatment and comparison group. While matching methods can be used to develop a comparator area that closely matches the observable characteristics of areas close to the cultural or sporting event venues, the matching methods are unable to control unobservable differences between the two groups.

The analysis will be driven by longitudinal data providing annual observations of the outcomes of interest at a small area level before and after the cultural or sporting event. This will support the application of econometric methods that can mitigate some of these issues. For example, it will be possible to apply ‘fixed effects’ regression models that are robust to unobserved differences between areas that do not change with time and allow for unobserved time specific shocks or trends affecting all areas (e.g. disruption caused by the introduction of covid-19).

2.8 Applying Spatial Analysis

As set out in section 2.5, major events do not typically have a well-defined target area in which benefits are expected to be realised. An econometric assessment of the impacts of legacy events would therefore need to consider the extent to which impacts vary with distance from the host city – potentially an important addition to the analysis if there is expected to be heterogeneity in the impact of hosting a major event between areas that are closer and further away from the host city. This would also support an understanding of any issues of displacement or crowding out driven by the events. For example, if the events encouraged local clustering around the venues, then this would be visible in positive impacts close to the venues and negative effects further away.

Spatial analysis would involve making comparisons between areas that are closer and further away from the host city, in both treatment and comparison areas, to help reveal the impacts of the major event on the local populations. A common strategy for implementing this approach is to allocate respondents to ‘buffer zones’ of increasing distance (e.g., 0km to 5km, 5km to 10km, etc.) from the centroid of the host area. Comparisons would then be made between outcomes for areas less than 5km away from an area centroid with those less than 5km away from a comparison area centroid, and so on at increasing levels of distance. However, as previously set out, there are no well-defined target areas for major events. This would necessitate additional scoping work to derive suitable ‘buffer zones’ to include in the analysis.

This measure of distance can then be included in a regression framework that compares change in the relevant outcomes in areas closer to and further from the host area centroid, before and after implementation of the scheme. This type of approach is known as a ‘distance-decay’ model.

2.9 Lighter touch Evaluation Approaches

A conventional approach to impact evaluation is typically based on individual-level outcome data collected before, during, and after the event in question, often over many time periods. It also concerns itself with creating a credible counterfactual and various robustness checks that make the case for its validity. By far the most important example of such an approach are difference-in-differences (DiD), with either a natural, matched, or synthetic control group. While being best practice for evidence based policy-making, such conventional approaches may take time, effort, and are also otherwise resource-intensive. This section explore ‘lighter-touch’ approaches or alternatives that can still be used to discern long-term impacts of events in a quicker, less data intensive or cheaper way.

Simplifications

Various simplifications of conventional approaches are possible that individually or jointly can save time and effort, making these approaches lighter-touch:

Aggregated time series could be used instead of individual-level outcome data towards, which reduces the need to access often restricted, secure-access micro data. Such time series often exist for many outcomes at various levels of spatial aggregation, and are readily accessible for free from the ONS or other public sources (e.g. GDP at the Local Authority District (LAD) level, regional wellbeing indicators). If necessary, spatial units can also be constructed using population-weighted averages of time series.

The number of time points at which outcomes are measured could be reduced, for example to two before (to check for common trends), none during, and only few after the event has taken place. This could be particularly helpful if the conventional impact evaluation involves the collection of own primary data.

Modification of the counterfactual. Typically, analysts strive for counterfactuals that are as comparable as possible to the treated, to be shown using balancing checks in pre-treatment observables and similar levels in common trends. Yet, counterfactuals do not need to be perfectly comparable to the treated. In fact, they can be quite different. For them to be valid, it is sufficient to “only” follow the common trend, even on very different levels of pre-treatment outcomes. Such a “balancedness in bias” greatly reduces the requirements imposed on counterfactuals and enables analysts to seek out counterfactuals even amongst very different units, for example from aggregate national trends. If the conventional impact evaluation involves the collection of own primary data, a cost-effective workaround to sampling the counterfactual could be to create a matched observational counterfactual from existing secondary data – an entirely different data source.

Drop the counterfactual altogether. While far from being best practice of evidenced-based policy making, before-after comparisons are often used to get a quick glimpse at impact. In certain cases, such as when time elapsed between the start and end of an intervention is relatively short (say, only a few weeks), a before-after comparison can get quite close to the ‘true’ impact of an intervention. For longer-term impacts, however, this approach is less advisable as time trends become more likely to influence outcomes, perhaps more so than the actual intervention did.

Reduce the number of outcomes. Often, the desire to measure a wide range of outcomes imposes constraints on which secondary data can be used, or logistically and financially if own primary data collection is involved. Here, a focus on outcomes that readily lend themselves as (ideally monetisable) net welfare measures, such as life-satisfaction points (or WELLBYs), may be advisable. The idea behind net welfare measures is that they capture all possible mechanisms/channels through which an intervention may affect individuals, positively and negatively, and so measure the overall net impact of what has happened to them. If analysts are not interested in showing these mechanisms/channels, a focus on one overarching measure can greatly ease logistical and financial constraints.

Qualitative Alternatives

Besides simplifications to conventional approaches, which are almost entirely quantitative in nature, qualitative alternatives exist that, by definition, do not require a sophisticated causal research design (though each alternative has their own methodology and requires expert knowledge and careful implementation). Importantly, they can be standalone or complementary quantitative approaches, to shed more light on mechanisms/channels where these cannot be fully (or at all) captured by quantitative data:

Interviews and Focus Group: This alternative engages various stakeholders through interviews and focus groups (interviews of specific groups) to understand the temporal, longer-term impact of past mega events on themselves and their communities. Thereby, stakeholders can be diverse, ranging from private persons to public officials to business or charitable organisations. This can entail various outcomes, from social cohesion to general wellbeing as a perceived or felt consequences of the event. In terms of feasibility considerations, analysts need to consider how to ensure representative and ethical participation, and whether findings can be directly linked to the event. This necessitates skilled moderators, consent procedures, and relevant support systems. So, this alternative can provide in-depth insights but may also be subject to selection bias and limited external validity, if not properly conducted. Hence, significant planning, recruitment, and facilitation expenses may be involved, while the analysis of qualitative data can be time-intensive. Interviews and focus groups can be valuable for a nuanced understanding of longer-term impacts, either standalone or complementing conventional, quantitative approaches.

Case Studies: Case studies are more focussed than single interviews of specific focus groups, zooming in on specific individuals, groups, or organisations. They are in-depth explorations of experiences of the event and pathways of adaptation over time, focussing on specific outcomes. As with interviews and focus groups, self-selection and external validity are challenges, and case studies similarly require consent and ethics approval. Unlike interviews and focus groups, how-ever, they can involve a range of data, from interviews to diaries, and inputs from third parties. In terms of robustness, case studies offer rich insights, though at the expense of generalisability. Yet, case studies can provide valuable narratives of potential longer-term impacts, either standalone or complementing conventional, quantitative approaches.

Utilising Theory Based Methods

Gathering both qualitative and quantitative information can support a theory-based approach which provides a structured way of understanding an event using hypotheses of how and why the event is producing desired outcomes and testing them. This approach can help identify what is working within the event delivery and where outcomes could be increased, or legacy could be better leveraged. This approach is therefore a useful addition to an impact evaluation to help understand how the event has created legacy and impacts. The following methods can make up a theory-based evaluation with more detail available in the HM Treasury Magenta Book:

Realist evaluation- Specific, hypothesised causal ‘mechanisms’ for an ‘outcome’ are articulated in ‘context’ and evidence gathered for each. The ‘mechanism’ explains why participants may take advantage of an opportunity or not depending on the ‘context’, and their understanding is key to causal inference.
Contribution analysis- Step-by-step process used to examine if an intervention has contributed to an observed outcome by exploring a range of evidence for the ToC. It gives an evidenced line of reasoning rather than definitive proof.
Process tracing- A structured method examining a single case of change to test whether a hypothesised causal mechanism, such as that proposed by the ToC, explains the outcome.
Contribution tracing- Participatory mixed-method to establish the validity of contribution claims with explicit criteria to guide evaluators in data collection and Bayesian updating to quantify the level of confidence in a claim. Includes a contribution ‘trial’ with all stakeholders to establish what will prove/disprove the claim.

2.9.1 Choosing a model specification

As with the choice of counterfactual there are several model specifications to choose from. The diagram below provides an indication of what context is needed to make this decision and how this enables you to choose the most appropriate model specification.

Figure 6: Decision map for selecting a model specification

2.10 Analytical software required

An evaluation of the long-term impact and legacy of major events most likely require the use of both statistical software and Geographical Information System (GIS) software.

The potential counterfactual designs set out in Section 2.6.3 will likely require an area to be spatially defined according to administrative boundaries. GIS software can be used to map the administrative boundaries and can also be used to inform distances from the epicentre of the event (for distance-decay approaches). Common GIS software includes R and QGIS which are all free to use, and ArcGIS can be used for free providing it is for non-commercial use.

Using the econometric methods set out in the later chapters of this report requires the use of specialised statistical software. The most common programmes used within the field of quantitative impact evaluation are Stata (which requires a paid licence), R and Python (both are free to use).

Question 4

3. Designing the data collection plan

Accepted Answer

The availability of appropriate data will dictate the feasibility of evaluating the major event using the desired counterfactual approach. This section sets out a non-exhaustive list of secondary data sources that may be useful to evaluate the legacy impact of major events and considers how primary data collection could be used to underpin an evaluation.

3.1 Primary Data Collection

To improve consistency across major event evaluations and to avoid additional costs we suggest using secondary data where possible. Where secondary data is not available, either because the outcomes of interest are not included in available datasets or do not have the geographical coverage required, primary research can be used. An appropriate sampling technique should be used to ensure data collection is representative of target audiences and a sufficiently high quantity of surveys is used to be statistically significant. Adding questions to existing data collections could provide a cost effective and reliable way of collecting data.

3.2 Secondary Data

Secondary data sources could be used to measure the legacy effect of major events (in addition to, or instead of, primary data). Secondary data sources can be grouped into two broad categories: National surveys and administrative datasets. The table below presents a non-exhaustive list of data sources, identifying which ToC they are relevant to. For more detailed information on each dataset, please see Appendix 2.

Table.2: Suggested secondary data sources

National Data	Cultural and social	Economic and Employment	Health and Wellbeing	Environmental Responsibility and Accessibility	Reputation
Continuous Household Survey (NI)	✓	✓	✓
International Passenger Survey		✓

Regional Data	Cultural and social	Economic and Employment	Health and Wellbeing	Environmental Responsibility and Accessibility	Reputation
Taking Part Survey (now the Participation Survey)	✓		✓
Community Life Survey	✓	✓	✓
HMRC Regional Trade Statistics		✓			✓
GB Tourism Survey					✓
Health Board (Scotland)				✓
Scottish Health Survey			✓
Labour Force Survey		✓
Great Britain Data Survey	✓	✓
ONS Business Register and Employment Survey		✓
National Travel survey	✓	✓

Local Authority Data	Cultural and social	Economic and Employment	Health and Wellbeing
Scottish Household Survey	✓	✓	✓
National Survey for Wales	✓	✓	✓
Active Lives Survey/ Active People Survey	✓		✓
VOA Ratings List		✓
DLUHC Permissions, Starts and Completes		✓
National Pupil Database	✓

LSOA Data	Cultural and social	Economic and Employment	Health and Wellbeing	Environmental Responsibility and Accessibility	Reputation
ONS GVA Estimates		✓
Indices of Deprivation		✓	✓

Output Area Data	Cultural and social	Economic and Employment	Health and Wellbeing	Environmental Responsibility and Accessibility
DWP Benefits Database		✓
Understanding Society (including BHS)	✓	✓	✓
Business Structure Database		✓
National Travel Survey				✓
Census Data		✓		✓

Postcode Data	Cultural and social	Economic and Employment	Health and Wellbeing	Reputation
Business Register and Employment Survey		✓
Annual Survey of Hours and Earnings		✓
Annual Population Survey	✓	✓	✓
Annual Business Survey		✓		✓

Household Data	Cultural and social	Economic and Employment	Health and Wellbeing	Environmental Responsibility and Accessibility	Reputation
Land Registry HP Data		✓	✓

3.3 Data Accessibility

The ONS Secure Research Service (SRS) may be used to access many of the datasets at low levels of geography (i.e. LSOA level and below). The ONS SRS can be accessed by Accredited Researchers, where each project must be approved to utilise the SRS data. Given the breath of outcomes that a major event is likely to span, there are several practical considerations that need to be made when submitting an SRS application to access data^{[footnote 9]}:

It is currently not possible to combine some datasets, namely the Annual Population Survey (APS) and Business Structure Database (BSD). For example, the APS is not available as a postcode-free version, and the BSD cannot be combined with postcode level data due to potential disclosure risks.
Given how far the impacts of a major event are expected to reach (both economic and social) – a magnitude of data sources are required to explore the full array of impacts. This poses a practical challenge as all external datasets need to be ingested by the ONS which can take time. We therefore suggest that critical datasets are ingested into the SRS but where possible, analysis is undertaken outside the SRS.

Question 5

4. Specific issues in evaluating legacy impacts

Accepted Answer

Evaluating the long-run (or legacy) impacts of major events is challenging for a number of factors, and these factors become ever more important the further past the event the evaluation attempts to measure outcomes. Essentially, these factors are the same as when conducting a short-run impact evaluation, but they become relatively more of an issue the longer the evaluation period is.

4.1 Choice of Counterfactual

The choice of counterfactual becomes increasingly important the longer the evaluation extends the event. We typically choose a counterfactual location that is as similar as possible (in terms of observables, and thereby, unobservables) to the treatment location. This often results in the counterfactual location having the same time trend in outcomes as the treatment location, though in most cases of short-run impact evaluations we do not require both to be on the same level (i.e. we allow for bias as long as that bias follows the same time trend between treatment and control location). This is mostly good enough for evaluating the short-run impacts of major events but may lead to a deviation in time trends the further events are in the past.

For example, when evaluating the short-run impacts of London 2012, Berlin may not be on the same level in outcomes as London but may follow the same time trend over a short period of time, and hence may lend itself as a suitable short-run counterfactual. However, over longer evaluation periods, differences in levels at the beginning are likely to lead to deviations from the common trend over time.

To evaluate the long-run (or legacy) impacts of major events, we need to, ideally, strive to find counterfactual locations that do not only have the same time trend in outcomes as the treatment location but also be on the same level in outcomes pre-treatment. This will reduce the likelihood that counterfactual locations deviate from common trends over longer evaluation periods. In addition, we may wish to control for potential confounders over longer evaluation periods (such as demographics or economic conditions, e.g. local population level or GDP per capita) to enforce a common trend conditionally.

4.2 Selection

The issue of selection becomes more important the further past the event we attempt to measure impact. There are two types of selection: within-sample selection (residential sorting) and out-of-sample selection (attrition):

Within-Sample Selection (residential sorting)

First, within-sample selection occurs if individuals in the treatment location switch to the counterfactual location, and vice versa, which becomes relatively more likely the further events are in the past.

For example, individuals living in the counterfactual location may move to the treatment location (which may become relatively more attractive in case of positive long-term impacts). However, individuals in the treatment location may also move to the counterfactual location. Ultimately, such residential sorting is a function of both preferences, endowments, and the sign and size of potential long-term impacts.

To account for within-sample selection, we need to, ideally, use panel data and restrict our estimation sample to non-movers. If these treatment dummies turn out insignificant, we may be somewhat less concerned about residential sorting. We may also want to replicate this exercise for the direction of the move, using a dummy for moving towards the treatment location and a dummy for moving towards the counterfactual location, respectively. Finally, depending on the direction of the move, we may be able to make a bounding argument, arguing that any remaining within-sample selection may result in a lower-bound estimate of long-run (legacy) impacts.

Out-of-Sample Selection (Attrition)

Second, out-of-sample selection occurs if individuals in either treatment or counterfactual location drop out of our estimation sample. Such selection becomes an issue only insofar as dropping out is correlated with the outcomes, in either location.

For example, individuals living in the counterfactual location may drop out because they feel neglected relative to their peers in the treatment location and may therefore refuse to take part in any follow-up survey. However, individuals living in the treatment location may also drop out because their life circumstances may have changed due to treatment, i.e. the major event.

To account for out-of-sample selection, we need to, ideally, use panel data and restrict our estimation sample to a balanced panel. As an alternative, we may use weighting and weigh our observations by the inverse of the probability of remaining within the estimation sample. Moreover, we may want to test whether our mega event as treatment predicts subsequent attrition, by regressing a dummy that equals one if an individual has dropped out of the sample on treatment dummies, one for each time period in the past. If these treatment dummies turn out insignificant, we may be somewhat less concerned about attrition. Finally, we may be able to make a bounding argument, arguing that any remaining attrition may result in a lower-bound estimate of long-run (legacy) impacts.

It should be noted that it may be an objective of the event to bring new people to an area with new housing and regeneration. In this scenario it would therefore be important to capture the changing local population within the evaluation rather than using the same treatment group of individuals.

4.3 Changes in Administrative Boundaries

Over time, administrative boundaries change. This becomes more likely the longer the evaluation period is, yet is only an issue if areas originally part of the treatment location become part of the counterfactual location, and vice versa. Thereby, most likely, smaller administrative boundaries are consolidated into larger ones, as evidenced in various locations in the recent past.

To account for changes in administrative boundaries over time, especially bias arising in our treatment effects due to such changes, we should always consider those administrative boundaries at the end of our long-run evaluation period, and backtrack them towards the beginning, thereby keeping consistent administrative boundaries throughout.

4.4 Confounding Events

The longer the evaluation period, the more likely it is that there will be a confounding event, i.e. a major event or other form of local investment that occurs in either the treatment or counterfactual location during the years after the original event. Such confounding events can be either related to the original event or completely unrelated (exogenous). Whether the confounding event is related or not is difficult to ascertain empirically, which is why additional, most likely qualitative evidence is required. In any case, the presence of the confounding event in the treatment location renders that location being multiple treated, whereas the presence in the counterfactual location renders that location treated. It will then be difficult if not impossible to disentangle the short-run impacts of the confounding event with the long-run impacts of the original event.

To account for confounding events over time and mitigate risks at least somewhat, it is advisable to sample more than one counterfactual location, tentatively when the evaluation period exceeds five years.

Question 6

5. Toolkits

Accepted Answer

The five toolkits below provide guidance provide detail specific methods.

5.1 DiD Toolkit

5.1.1 What is the technique?

Difference in Differences (DiD) is a research design which seeks to understand the (causal) impact of a programme/intervention on the affected areas or individuals (i.e. areas which hosted a major event or those living in an area likely to feel the impacts of the local event). The DiD achieves this by comparing changes in outcomes over time between areas (or individuals) that were local to the major event and those that weren’t.

In its simplest form, DiD examines the difference in the outcome over time in the comparison group, and subtracts this from the difference in outcome over time in the treatment group (i.e. an area local to a major event).

The key identifying assumption under which DiD produces robust estimates of the treatment effects is parallel trends. This assumption states that, in the absence of the event, differences in the outcome between treatment and comparison groups would have remained constant during the post-event period.

5.1.2 What does is set out to achieve?

The DiD design seeks to identify the causal effect of an intervention (or an event) on its intended beneficiaries. Using the DiD design, observed impacts can (under certain assumptions) be attributed to event that has taken place (at SMS Level 2, or higher with matching, see next toolkit).

There are further extensions of the DiD:

To understand the relationship between impact and the intensity of exposure (i.e. dose-response)
To understand the relationship between impact and the distance from the event (i.e. distance-decay)
Where applicable exploit the timing of the rollout of the intervention through a a staggered DiD. This can be a powerful design as it is in principle robust against selection bias. This design could be used where the event moves around a host city over time (creating several ‘treated units’ at different points in time)
The estimation of the dynamic effects of the event – i.e. can identify a different treatment effect for every year post-event.

5.1.3 What data do you need and where do you get it from?

DiD analysis needs data collected over multiple time points in order to compare changes over time. DiD analysis is best conducted with longitudinal data, meaning data collected over multiple time points for the same sample to identify changes in outcome levels within individuals (using fixed effects models to increase the causal inference of the results). However, longitudinal data can be difficult to obtain. It is also possible to compare average outcome levels between individuals using cross-sectional data which is collected consistently over time from the same population (or area) but with a different sample.

Some suggested data are included below:

The Active Lives (Adult Survey) and its predecessor Active People Survey is a repeated cross-sectional survey of physical activity levels. This data has been successfully used for a variety of evaluations to assess the impact of policies and government interventions on participation in physical activity (see Owen 2023).
The Annual Business Survey (ABS) is an annual survey of businesses, covering the production, construction, distribution and services industries. Longitudinal measures are available for large firms with a limited longitudinal element for SMEs with resampling occurring every few years. The ABS can be used within a DiD analysis to assess how an event has increased local output and productivity.

A non-exhaustive list of other potential data sources can be seen in Section 3 and Appendix 2.

5.1.4 How do you implement the technique?

Organise the Data: In a standard DiD approach, you need to have a clear distinction between the treated and control groups across the (at least 2) time periods: before and after the intervention. The dataset should be organized to capture these two time points adequately. Ensure all relevant variables are included to enable a comprehensive analysis of the changes observed pre- and post-intervention.
Set Up Control and Treatment Groups: Identify and categorise your data based on control and treatment groups. This classification is essential for evaluating the relative impact of the treatment. Ensure that the groups are comparable and no cross-over occurs during the study period. Utilise synthetic controls or propensity score matching as a priority as shown in later toolkits. If these methods are not possible you can use unsuccessful bidders, other event winners or manual selection to identify a relevant counterfactual area. The metrics for these areas can be averaged or weighted to create a sensible counterfactual.
Select Analytical Software and Tools: Standard DiD analysis can be performed using statistical software such as R, Stata, or Python. Packages like lm in R or regress in Stata are sufficient for running basic DiD analyses. These software tools offer various functions to help conduct regression analysis proficiently.
Run Diagnostic Checks: Verify the parallel trends assumption, a critical requirement in standard DiD analysis. It suggests that, absent treatment, the average change in the outcome for the treatment group would have been the same as for the control group. Consider pre-treatment trends and visual inspections to ensure validity.
Estimating the Model: Formulate a regression model that includes interaction terms to evaluate the treatment effect. The coefficient of the interaction term will reflect the treatment impact. It’s essential to ensure your model is adequately specified and includes potential confounders.
Interpretation of Results: Interpret the coefficient of the interaction term to understand the difference in change over the two time periods between treated and control groups. This effect represents the causal impact of the intervention. Clearly articulate the findings and potential limitations in your reports.

5.1.5 Where and when has it been used in our field previously?

Staggered DiD: Callaway and Sant’Anna (2021) propose a non-parametric staggered DiD estimator, designed to compute different treatment effects for each group of units that started treatment (or rather held the event) at different start dates using a pipeline design. The group-specific treatment effects can be averaged to estimate an overall treatment effect.

Fixed Effects DiD: Srakar and Vecco (2017) explore the economic effects of the 2012 European Capital of Culture (ECoC) in Maribor. They utilise a DiD framework (estimated by fixed effects, as well as random effects models) to estimate the impact the ECoC on tourism through the number of visits and overnight stays and the effects employment through the number of new jobs created.

5.1.6 What are the pros and cons?

A major benefit of using DiD is the relative simplicity in comparison to other methods. Due to its relative technical simplicity it is easier to implement and also more intuitive to explain. This makes it a more accessible method to implement and a suitable method to use in research for non-technical audiences.

However, DiD relies on the assumption that the treatment and control groups would have followed the same trends in the absence of the major event (parallel trends). If this assumption does not hold, the estimated impacts do not represent causal estimates of impact.

5.1.7 When is it appropriate to use it?

Difference in difference is a flexible and well-used method, suitable for a range of circumstances with either cross-sectional or longitudinal data. There are conditions where DiD becomes difficult and therefore a different approach may be more appropriate:

If using longitudinal data to measure the impact on individuals or households, it is important to check if there has been significant population change post event. If the population within the target area is particularly unstable, this risks missing the impact on people who are new to the area and including populations who no longer live in the target area.

5.1.8 How was it used in the London 2012 Games Case study?

The treatment and comparison areas were used within a difference-in-difference (DiD) framework. This sought to make comparisons between treatment and control areas, before and after the Games.

The DiD design was estimated using fixed effect regressions. We used both a binary treatment variable and an event study. The binary treatment specification estimated an aggregate measure of impact after the Games. The event study interacts the treatment variable and a binary year variable to estimate year specific effects. The benefit of the event study is that it shows how the magnitude and significance or the impacts change over time, and accounts for temporal variability in the impacts.

Validity of the Design: There are diagnostic practices that were used to assess the quality of the match and validity of the DiD (i.e. the parallel trends assumption). These checks should be undertaken for each means of constructing a counterfactual (i.e. PSM or synthetic control), prior to estimating the DiD:

Assessing the quality of the match: The quality of the match generated through the matching algorithm was assessed using standardised mean differences (SMDs). SMDs are a measure of the size of the difference between two groups; and are calculated by dividing the difference in means of the two groups by the pooled standard deviation of the groups. Following best practice, an SMD greater than 0.1 denotes meaningful imbalance in the baseline covariates. Balance tests undertaken for this case study indicate that a good quality match was achieved, suggesting that pre-treatment and comparison areas share similar levels of pre-intervention matching variables.
Assessing parallel trends: The event study plots the extent to which the treatment and comparison areas differed prior to the Games. Across all the event study regressions, the vast majority of pre-intervention periods are statistically insignificant, indicating that prior to the Games, the outcome indicators in treatment and comparison areas shared similar trends. This cannot directly confirm that in the absence of the Games the treatment areas would continue to trend in the same way as the matched control areas. However, it does provide confidence that the treatment areas would have likely trended in this way in the absence of the Games.

5.2 Event Series Counterfactual Toolkit

5.2.1. What is the technique?

The event series model uses future events that occur in a series to expand the areas used in the counterfactual. Areas that have won or were unsuccessful applicants in the future provide a useful comparison as they are likely to have similar characteristics to the event of interest but have not yet received the treatment.

5.2.2 What does is set out to achieve?

The primary aim is to expand the number of counterfactual areas used. Increasing this sample improves the robustness of the approach.

5.2.3 What data do you need and where do you get it from?

This approach only applies where the event is part of a series which changes location (unsuccessful applicants can only be used in the counterfactual if part of a competitive process). As with the other counterfactual approaches. Consistent data is need across all areas and the timeline.

5.2.4 How do you implement the technique?

Prepare your data. Ensure that your dataset includes all control areas identified.
Construct the counterfactual. For each time period calculate the average value of your outcome variable across all control areas. This gives you a single “average control” for each period. When a control area becomes a “treated” area i.e it is announced as an event winner. It should be removed from that time period and any future time period. You could consider using the start of the event itself as the start of the treatment, but this is not the most robust approach.
Estimate Differences in outcomes between the treatment and control group. This follows the approach set out in the Difference in difference toolkit.

5.2.5 What are the pros and cons?

Provides increased robustness above using only the unsuccessful applicant areas from the event year. However, when you start to plan your evaluation it is likely you will not know the future winners and therefore it is not clear if the data will be available. We therefore suggest using secondary data available nationally to be sure the data is available for all future control areas.

5.3 Propensity Score Matching (PSM) Toolkit

5.3.1 What is the technique?

Propensity score matching (PSM) is the process of matching treated and non-treated units based on the estimated probability that each unit would have received the treatment.

In the context of evaluating major events, treatment and control area would be broken into geographical units (informed by available data and expected impacts). Each geographical unit from the treatment group is then ‘matched’ with members of the comparison group that share a similar estimated probability of being closely aligned to the event host city (i.e. the propensity score). The propensity score can be estimated using either a probit or logit model.

A comparison group can then be established by matching each treated geographical unit with one (or more) untreated geographical units that have similar values in terms of their propensity score. Units that do not return a match are not included in the analysis.

The outcomes of the treatment and matched comparison units can then be compared over time.

5.3.2 What does is set out to achieve?

PSM seeks to ensure that the observable characteristics of the treatment group are as similar as possible to the observable characteristics of the comparison group, by using the observable characteristics are used to estimate the propensity score.

It is often difficult to find a control area that is comparable to the target area. By matching areas with similar propensity scores, PSM creates groups that are more comparable in terms of these observed characteristics. This helps to reduce the bias that can arise from differences in these characteristics between the groups.

Having comparable treatment and control areas helps reduce the effect of confounders (when a variable is related to both the treatment and the outcome) by ensuring that the treatment and control groups are similar on the observed confounding variables.

PSM can also be a valuable tool to enhance the reliability of DiD analysis by helping to create more comparable treatment and controls (pushing the DiD up to SMS Level 3).

5.3.3 What data do you need and where do you get it from?

For PSM to be successful, a variety of variables which provide insight on the similarity between areas is needed. Not using enough variables or using variables which do not provide helpful information to inform similarity between places in the context of major events may result in areas being matched which are not necessarily comparable.

The matching variables used is likely to differ according to the outcome that are being evaluated, the researcher should therefore consult the relevant literature and use their own intuition and experience to identify variables which are likely to influence the outcomes of interest.

5.3.4 How do you implement the technique?

PSM can be implemented using the following approach:

Prepare your data: Ensure that your dataset includes all necessary covariates that influence the selection into treatment. This step involves data cleaning and variable selection to model the propensity scores accurately.
Estimate a propensity score for each ‘unit’. This is the probability that the unit would be in the treatment group, conditional on observed characteristics. Either a logit or probit model can be used. You can use logistic regression or other appropriate models in software like R (using the MatchIt package) or STATA (using the psmatch2 command)
Establish a comparison group. Match each treated unit with one (or more) untreated units that have similar values of the propensity score in Step 1. Untreated units that are not matched are excluded from the analysis. Various matching techniques like nearest neighbour, caliper, or exact matching can be employed to create comparable groups. Neighbour matching provides the simplest approach which finds the nearest match with no additional constraints.
Assess the quality of the match by comparing SMDs across matching variables (covariates). There are two different approaches that can be taken once the matched sample is identified and corresponding weights estimated:
Estimate Differences in outcomes between treatment and matched comparison group. Use the weights for each unit that are estimated as part of the matching algorithm. This acts as a ‘weighted difference’.
Combine PSM with DiD to increase the robustness of the analysis by controlling for observable and unobservable confounding effects. PSM alone can only account for biases due to systematic differences in observable characteristics – combining the propensity score with DiD increases the robustness by controlling for unobservable time-invariant confounders.

5.3.5 Where and when has it been used in our field previously?

Falk and Hagsten (2017) utilise a PSM-DID approach to explore impacts of the European Capital of Culture (ECoC) on the level of overnight stays between 1998 and 2014.
A comparison group of non-ECoC cities (selected through the use of propensity score matching) is used to inform the counterfactual scenario (i.e. what would have happened had the ECoC cities not hosted the ECoC).
Matching variables: population, presence of an airport, sea boarder, UNESCO world heritage site, mediterranean climate zone, university listed in the Times higher Education ranking, capital city.

The results indicate that hosting the ECoC leads to a short term increase in tourism demand, approximated by the number of overnight stays, however this increase is not sustained in the long term. The average treatment effect of hosting the ECoC event is 8% (equal to around an additional 40,000 overnight stays) and significant at the 99% confidence level for the years 1998 to 2014 – however the positive effect diminishes the year after the event – indicating a temporary boost in tourism.

The Authors also explore city-specific impacts, finding that of the 27 ECoC’s, long-term increases in tourism demand was only detected in four cities (Essen, Guimaraes, Salamanca and Tallinn). Industrial cities gain the least, or even suffer, from hosting the ECoC (negative effects are observed for Rotterdam, Liverpool, Genoa, Stavanger and Marseilles).

Limitations: Firstly, the number of overnight stays is a narrow definition of tourism demand (but data limitations meant that overnight stays presented the best common measure amongst the sample). Secondly, spillover into local neighbourhoods not investigated – or the extent to which demand is merely displaced from other areas of the country (limiting the extent to which we can truly call the increase in tourism ‘additional’).

5.3.6 What are the pros and cons?

PSM only addresses observed covariates; it cannot account for unobserved variables that might still bias the results. The effectiveness of PSM relies heavily on the chosen matching algorithm. Different algorithms can yield varying results, requiring careful consideration and sensitivity analyses.
The matching process can lead to a significant reduction in sample size, as control areas without suitable matches to the treatment are excluded. This can impact the statistical power of the analysis and the generalisability of the findings.

5.3.7 When is it appropriate to use it?

If you have a rich dataset with numerous variables that could influence how an area hosting an event could impact the outcomes of the event, PSM’s ability to match areas based on these similarities becomes advantageous.

5.3.8 How was propensity score matching used in the London 2012 Games Case study?

The absence of a spatially defined target area meant it was difficult to identify a point at which we would reasonably expect the economic impacts of the Games to be less likely to reach. To somewhat mitigate this, we created an exclusion zone. Different radii are experimented with (from within London comparisons to 40km). A larger exclusion zone carried the benefit of a decreased likelihood of benefits contaminating comparison areas. However, it also increased the risk that comparison areas did not resemble the treated areas –particularly important when considering the uniqueness of London. This may pose a threat to the sample size, however for this case study, at low levels of geography (i.e. LSOA), this was not anticipated to affect the analysis.

Given the uniqueness of London, we also experimented with restricting the potential comparison areas to the LSOAs in non-host boroughs. This ‘within-London’ comparison tested for differential impacts in terms of changes in outcomes between East London, and the rest of London.

Figure 6: Comparison areas for used for London 2012 Games Case Study

Propensity score matching algorithms were used to identify the LSOAs (outside the exclusion zone) which shared similar characteristics with those close to the event, prior to the Games taking place (i.e. between 2005 and 2011). The matching variables included:

Table 3. Matching variables used for London 2012 Games

Indicator	Years	Reason for inclusion
Share of tourism employment (% of total jobs)^{[footnote 10]}	2005 - 2011	Identifies the importance of the tourism sector within the local areas
Share of manufacturing employment (% of total jobs)	2005 - 2011	A simple measure of the industrial structure of each area
Total employment	2005 - 2011	A proxy measure of the economic output in the area
Share of tourism revenue (% of total revenue)	2005 - 2011	Identifies the financial significance of tourism in the local area
Share of manufacturing revenue (% of total revenue)	2005 - 2011	Identifies the financial significance of the manufacturing industry in the local area
Total revenue	2005 - 2011	A proxy measure of the economic output in the area
Total number of firms	2005 – 2011	Provides an indication of economic activity in the area and potential employment opportunities.
Gross (median) weekly earnings	2005 - 2011	A measure of the underlying productivity of the local economy

Areas that were successfully matched were considered the counterfactual group of areas in which the treatment areas were compared against.

This analysis uses a ‘nearest neighbour without replacement’ matching algorithm. This meant that each LSOA near the Olympic Park was paired with a similar LSOA further away (beyond 30km), ensuring a direct comparison without needing to adjust for other factors. This approach allowed us to use the existing longitudinal weights from the Understanding Society dataset. Other matching algorithms might be considered depending on the specific econometric analysis and outcomes being explored.

5.4 Synthetic Control Toolkit

5.4.1 What is the technique?

The synthetic control method offers a unique approach to constructing a counterfactual in scenarios with a single (or very few) treated units. The underlying principle is that a ‘blend’ of control areas enables more meaningful comparisons to be made than a single control area.

Instead of matching individual units like in propensity score matching, the synthetic control method creates a single “synthetic” unit that closely mirrors the pre-event trends of the treatment area.

For example, if you want to estimate the economic impact of a city hosting an event, instead of comparing the host city to other similar cities, the synthetic control method would combine data from multiple cities, weighting their contributions to create a hypothetical “synthetic” city that closely resembles the host city before the event. This synthetic area becomes the counterfactual, representing what would have likely happened in the host city without the event.

5.4.2 What does it set out to achieve?

The synthetic counterfactual is like creating a blend of outcomes from a group of similar cases, called the donor pool. Because the outcomes are influenced by things that cannot be observed, it is assumed that cases with similar outcomes before the event are also likely to be similar in ways that can’t be seen. When this blend of cases is used to stand in for the case that is being studied before the event happened, unforeseen circumstances are being balanced out.

5.4.3 What data do you need and where do you get it from?

Extensive longitudinal (or repeated cross-sectional) data is required for constructing a synthetic control. The rule of thumb is 10 periods of pre-treatment data to estimate a credible synthetic counterfactual.
This includes data on the outcomes of interest, as well as other ‘predictor’ variables that may determine the level of outcomes.

5.4.4 How do you implement the technique?

To estimate the synthetic counterfactual:

Select a pool of regions that could potentially serve as comparison points for the treatment area. These regions should share some similarities with the treatment area but should not be directly affected by the event.
Gather data on relevant indicators for both the treatment area and the potential control regions. This should cover approximately 10 periods before the event.
Using the statistical model assign weights to each control region. This involves employing convex optimization techniques to ensure that the weights minimize the discrepancies between the pre-event trends of the control and treatment areas. Specifically, weights are based on how closely the pre-event characteristics of control areas match those of the treatment area, forming a synthetic control that represents a hypothetical region mirroring the treatment area’s pre-event trajectory. You can utilize the Synth package to conduct this analysis in R.
There are two different approaches (a & b)that can be taken once the synthetic counterfactual is identified; (a) Differences in outcomes can be directly estimated between treatment unit and the synthetic counterfactual. (b) The synthetic counterfactual can be combined with DiD to increase the robustness of the analysis (SDiD) Arkhangelsky et al. (2021) describe how the SDiD estimator combines attractive feature of both the synthetic control and DiD: using the synthetic control to re-weight and match pre-exposure trends to reduce the reliance on the parallel trends assumptions and using DiD to remove unit level and time invariant effects. The SDiD can be implemented by using the synthetic control to estimate the weights for each outcome variable, where the weights are used in a two-way fixed effects (event study) regression.

5.4.5 Where and when has it been used in our field previously?

Zhang, Zhong and Yi (2016) utilise the synthetic control (SC) to explore the impacts of several environmental policies on the air quality around Beijing, introduced specifically for the 2008 Beijing Games. The environmental policies included measures such as driving restrictions, relocation and closure of heavy pollutant enterprises and coal desulphurisation. The Authors use air pollution days as the measure of air quality. An air pollution day is defined as a having an air quality index greater than 100, and is the point at which a persons health can be affected when taking part in outdoor activities. The environmental policies were introduced only in Beijing, allowing the Authors to construct a ‘synthetic Beijing’ constructed of other cities within China (which would provide a better comparison than any single comparison city).

To construct a synthetic Beijing, weights were assigned to each potential comparison city estimated using a series of predictor variables: industrial output, investment in environmental governance, number of industrial furnaces, coal usage, Greenland per capita, a dummy variable for whether the city is in the north of China, and past levels of pollution days.

Zhang, Zhong and Yi found that in the years following the 2008 Beijing Games, air quality improved, where annual pollutions days reduced the number of pollution days by 12 and 36 in 2009 and 2010 respectively. After 2010, the policy effect gradually decreased, before air quality reduced to pre-Games levels in 2012 – indicating there were no long-term improvements. The Authors rationalise these findings, explaining than many of them were only intended to be short-term policies with many policies reversed after the games.

5.4.6 What are the pros and cons?

The synthetic control is suited to undertake causal analysis in situations where there are small sample sizes. Given the small sample sizes, the synthetic control offers a transparent approach to constructing a counterfactual (as the contribution of each unit can be seen from their respective weight). When combined with DiD, the synthetic control can reduce reliance on the parallel trends assumption whilst simultaneously accounting for unobservable unit and time effects.
However, the synthetic control requires an extensive amount of time series data to generate a robust synthetic counterfactual.

5.4.7 How was synthetic control used in the London 2012 Games Case study?

In the London 2012 Games case study a synthetic control was used as an alternative econometric approach to estimate the impact of the Games. The synthetic control is best suited when there is a single treated unit. As such, the level of the analysis was changed from LSOA, to Local Authority (LA). This analysis compared the outcomes of the LA in which the Games were held (the London Borough of Newham) to a ‘synthetic Newham’, informed by LAs whose centroids were more than 30km away from where the Games were held.

Question 7

Appendix 1: Theories of change

Accepted Answer

Figure 7: Cultural and Social Impact

Figure 8: Economic and Employment Impact

Figure 9: Reputation

Figure 10: Health and Wellbeing

Figure 11: Environmental Responsibility and Accessibility

Question 8

Appendix 2: Secondary Data

Accepted Answer

This section explores how the quantitative outcomes identified through the ToC could be measured for the purpose of a counterfactual impact evaluation of major events hosted within the UK.

Data sources are typically sources from either national surveys, or administrative data sources.

National Surveys

The use of appropriate secondary data sources would be required to underpin a quantitative counterfactual evaluation of major events held in the UK. Monitoring data collected by event organisers would be important for a value for money assessment (cost-benefit analysis or 3 E’s assessment) – however will not provide the required information in both treatment and counterfactual areas to enable comparisons and identify impacts attributable to the major event.

The following sub-sections set out a non-exhaustive appraisal of data sources that could be used to explore the impact and legacy of major events.

Continuous Household Survey

The Continuous Household Survey is a repeated random cross-sectional survey carried out in Northern Ireland, conducted by the Northern Ireland Statistics and Research Agency (NISRA). The survey has been running since 1983 to present, and data from 1983/84 – 2017/18 is available through the UK Data Service.^{[footnote 11]} Documentation suggests that the dataset only provides information at the national level, with a sample size of c. 5,500 households.

Specific to this evaluation framework, the Continuous Household Survey collects data on:

Demographic/ economic characteristics
Physical and mental health
Sports participation – including days per week and membership to sports clubs
Participation in arts/ cultural/ heritage activities
Live events attendance (in both sport and the arts)
Volunteering
Attitudes towards environmental issues

Scottish Household Survey

The Scottish Household Survey is a repeated cross-sectional survey carried out in Scotland, conducted by Ipsos UK. The UKDS holds time series data between 1999 and 2019.^{[footnote 12]} The dataset is representative at the local authority level, and has an underlying sample size of c.10,000 households. The Scottish Household Survey collects data on the following:

Demographic/ economic characteristics
Physical and mental health
Attitudes towards climate change
Satisfaction with local services/ attitudes towards local community
Sport participation (including frequency)
Participation in the arts and cultural/ heritage activities
Volunteering
Trust in the Government
Recipient of harassment/ abuse/ discrimination

National Survey for Wales

The National Survey for Wales is a multi-stage stratified random repeated cross-sectional survey carried out in Wales, conducted by the ONS. The UKDS contains time series data between 2012/13 and 2022/23.^{[footnote 13]} The extent of the time series data may limit the use of this as a viable data source to conduct the case studies that accompany this evaluation framework as a sufficient number of pre- and post- outcome periods would be unlikely to be included in the time series. However, this remains a viable data source for future evaluation of major events that explores the legacy and long-term impacts. The National Survey for Wales collects data on the following:

Demographic/ economic characteristics
Physical and mental health (including subjective wellbeing)
Attitudes towards climate change
Satisfaction with local services/ attitudes towards local community
Sport participation (including frequency and intensity)
Participation in the arts and cultural/ heritage activities
Volunteering

Active Lives Survey and Active People Survey

The Active Lives (Adult Survey) and its predecessor Active People Survey is a multi-stage stratified random repeated cross-sectional survey. Combining both surveys, creates time series data dating back to 2005 on sport participation across England. Individual level data is available for both data sets, and are representative at the local authority level – with a sample size of c.170,000 adults. Data is accessible through the UKDS.^{[footnote 14]}

Information collected that is of particular relevance to this study includes:

Physical activity levels
Participation (including frequency and intensity)
Volunteering (including volunteering specific to sport)
Latent demand for sport(s)

DCMS Taking Part Survey (now the Participation Survey)

The Taking Part Survey is a multi-stage stratified random repeated cross-sectional survey, collecting data on engagement with leisure, sport and culture in England from 2005 to 2020. The Taking Part Survey is available through the UKDS.^{[footnote 15]} The sample size is c. 7,500 adults and also includes c.1,500 children – and is representative at the Government Office Regions Level. The Taking Part Survey includes data on:

Engagement with the Arts/ culture/ heritage
Satisfaction/ enjoyment from sport and culture
Volunteering
Subjective wellbeing

Scottish Health Survey

The Scottish Health Survey was first established in 1995 and provides time series data up to and including 2021, accessible through the UKDS.^{[footnote 16]} The Survey is representative at the Health Board level and has a sample size of c. 9,000 households. Whilst the survey typically varies year on year, there are a set of core questions repeated across all waves. The relevant items are presented below:

Loneliness
Physical and mental health
Physical activity (including frequency and duration)

DCMS Community Life Survey

The Community Life Survey is an annual household multi-stage stratified random sample repeated cross-sectional survey in England. The Community Life Survey has a sample size of c.10,000 and is representative at the Government Office Regional level. The Community Life survey consists of time series data ranging from 2012/13 – 2022/23. It is likely that this data source will not provide sufficient pre- and post- event data to explore the case studies that accompany this evaluation. However, the Community Life Survey would likely be a useful data source for the evaluation of future events.

The Community Life Survey contains relevant information on:

Economic/ demographic characteristics
Attitudes towards neighbours/ the local community
Volunteering
Donating to charity
Subjective wellbeing

Understanding Society and British Household Panel Survey

A UK longitudinal household survey, designed to understand the short- and long- term effects of social and economic change in the UK. Individual responses for Understanding Society is available through the ONS Secure Research Service (SRS)^{[footnote 17]}, where data can be aggregated at the OA level. The Understanding Society panel started with a sample size of 40,000 households (including 8000 of the original British Household Panel Survey households). The relevant data the Understanding Society collects includes:

Demographic/ economic characteristics
Sport participation (including frequency)
Perceptions of local neighbourhood
Volunteering
Subjective wellbeing

Annual Population Survey

The Annual Population Survey (APS) is a quarterly survey of the workforce in the UK. The survey samples c.320,000 individuals per year (with everyone sampled over four consecutive quarters). Data is available at the individual level through the ONS SRS, and data on the respondent’s postcode is provided. The APS contains data back to 2004. The APS collects labour market data (dating back to 1997) that may be relevant to evaluate economic and employment impacts associated with major events:

Education levels
Economic activity
Income
Benefits (e.g. housing, tax credits, Jobseekers Allowance, Pension benefit, etc.)
Physical health
Subjective wellbeing

GB Tourism Survey

The GB Tourism Survey (and its predecessor UK Tourism Survey) provides data on the volume and value of domestic overnight tourism trips taken by British residents. The GBTS provides data from 2006 – 2019, and reports statistics at the regional level. The GBTS can be accessed through an online data platform^{[footnote 18]}, and provided data on the following:

Purpose of trip
Region visited
Type of place visited
Accommodation used
Region of residence
Month and quarter of trip
Number of nights stayed
Expenditure

National Travel Survey

The National Travel Survey (NTS) is a series of household surveys that capture data on personal travel and changes in travel behaviour over time. The NTS contains data going back to 2002, and provides data at Output Area level – however DfT advise that this data is only robust above the regional level. The NTS can be accessed through the ONS SRS, and contains detailed data on:

Levels of active travel
Satisfaction with public transport
Usual place of work (including travel time)
Condition of roads and pavements
Journey time to schools, hospitals, post offices etc.

Annual Business Survey

The Annual Business Survey (ABS) is an annual survey of businesses, covering the production, construction, distribution and services industries. Individual firm-level data is available, as well a corresponding postcode for each firm. The ABS acts as a mandatory census of all large firms with over 250 employees – meaning longitudinal measures are available for large firms. However, SMEs and micro businesses contain a limited longitudinal element – with resampling occurring every few years. The ABS contain data on the following:

Firm productivity
Firm industry classification
Total turnover
Approximate GVA
International trade (imports and exports)

Annual Survey of Hours and Earnings

The Annual Survey of Hours and Earnings (ASHE) is an annual survey that is drawn from a simple random sample of 1% of all employees from HMRC’s PAYE records, yielding a sample size of c.156,000. The AHSE is available at the individual level ONS SRS, which includes data on the respondent’s postcode. The sample includes a longitudinal component, allowing individuals to be tracked over time – where approximately 70 to 80% of respondents are retained in the sample year on year. The key piece of information that can be taken from AHSE is the wage rate of the individuals in the survey – which can be used to see how wages in areas around events change overtime compared to comparison areas. The ASHE also includes Standard Industrial Codes which will allow the wage impacts on specific industries (e.g. tourism) to be explored overtime.

Business Register of Employment Survey

The Business Register of Employment Survey (BRES) published detailed employee and employment estimates at detailed geographical and industrial levels, including those within the tourism sector which is of particular relevance to this Evaluation Toolkit. The dataset contains firm-level responses (of c. 85,000) along with corresponding postcodes. The data can be accessed through the ONS SRS.

Administrative Datasets

Indices of deprivation

The English Indices of Deprivation (IMD) measure the relative deprivation in Lower Super Output Areas in England. The IMDs cannot be used to track the specific level of deprivation of an LSOA over time, rather than can be used to understand the relative level of deprivation over time – i.e. using the ranking (or decile) opposed to the score. This is particularly important given that the approach to estimating the IMD changes year on year – although the fundamental methodology and principle of the IMD remains the same meaning rankings are comparable. The IMD is updated sporadically: 2019, 2015, 2010, 2007, 2004, 2000. As such, the IMD may not necessarily be useful as an outcome indicator, rather it would be more useful in a matching model for ensuring conditional parallel trends.

DWP Benefits Database

The DWP Benefit Database (publicly accessible through Stat-Xplore^{[footnote 19]}) contains administrative data on benefits administered by the DWP in England. This includes all jobseekers related benefits, which can act as an indicator of slack in the local economy. The data is available at Output Area (OA), however it should be noted that counts less than 5 are suppressed to prevent disclosure. Further scoping work will be required to fully understand the lowest geographic level that can be used without suppressed data affecting the analysis. The database is updated quarterly and at the time of writing contains data from 1993 – May 2023.

VOA Ratings Listing

The Valuation Office Agency (VOA) publishes data on the commercial floor space and the rateable value per m2 of the floor space, aggregated at the local authority level, from 2000/01 – 2019/20. This provides a measure of commercial activity, and a proxy for commercial rents (acting as an indicator of the underlying productivity of an area).

DLUHC Permissions, Starts and Completions

The Department for Levelling Up, Housing and Communities (DLUHC) publishes data on the number of permanent dwellings starts and completions, including net additional dwellings and affordable housing sales. The data is published quarterly, dating back to 1980/81. The data is published at local authority level and can be openly accessed online.^{[footnote 20]}

Land Registry House Price

Land Registry provides data on all housing transactions registered with Land Registry dating back to 1995. The dataset includes price paid at the household level, and includes some minor characteristics such as property type (e.g. detached, semi-detached, etc.). The data is freely available online.^{[footnote 21]}

National Pupil Database

The National Pupil Database (NPD) contains a range of information about students in England, combining several different datasets including Key Stage Attainment Data, the Schools Census Data and examination results. The NPD contains data going back to 2002 and is available at the local authority level. The dataset contains information on educational attainment, and absence and exclusion data over time.

HMRC Regional Trade Statistics

HMRC publish data quarterly data on the UK’s international trade in goods and services. The data dates back to 2008 and is at the national level and can be broken down by Standard Industry Classification code. The data can be openly accessed online.^{[footnote 22]}

ONS GVA Estimates

The ONS publish annual small area (LSOA) Gross Value Added (GVA). The GVA estimates date back to 1998 and are available through the Nomis data platform.^{[footnote 23]} The Nomis data platform also gives the option to obtain the data at higher levels of geography including local authority or regional.

Business Structure Database

The Business Structure Database (BSD) provides an annual snapshot of the Inter-Departmental Business Register, covering all firms registered for PAYE and VAT (representing approximately 98% of economic activity in the UK). The BSD provides annual data, dating back to 1997, and can be accessed at the firm level (alongside the corresponding Output Area code) through the ONS SRS. The BSD provides the following metrics that may be relevant to an evaluation of major events:

Sector level changes in employment and revenue
Standard Industry Classification codes
Employment (site level)
Turnover (enterprise level)

There are several known limitations of the BSD, namely:

Data lags: The BSD is constructed using several data sources (PAYE and VAT returns, the Annual Business Survey or Business Register of Employment Survey returns). This means that the BSD is updated at different points of time depending on when the information arrives from other sources. This means that some of the records can be up to two years out of date. However we do not expect this to pose a significant challenge for an evaluation of legacy and (long-term) impacts of major events as sufficient time would have passed for relevant data to have been gathered.

Turnover data: turnover data is only captured at the enterprise level. This means that where firms have multiple sites, it must be assumed that there are equal levels of productivity across all sites

Productivity: It is only possible to derive proxy measures of productivity (i.e. turnover per worker)

Question 9

Appendix 3: Additional Technical Background

Accepted Answer

Difference-in-Difference

In its simplest form, the DiD estimator examines the difference in the outcome over time in the comparison group, and subtracts this from the difference in outcome over time in the treatment group (i.e. an area local to a major event):

Figure 12: Difference-in-Difference Calculation

Where Y represents the mean outcome of interest for either the treatment of comparison group, pre- or post-event; and the estimated impact of the major event on the outcome of interest would be τ. The differencing of outcome measures over time and between groups removes both time invariant and unit specific effects, partially limiting the extent to which unobservable characteristics introduce bias into the estimates.

DiD requires the use of longitudinal (or repeated cross sectional) data - Section 3 and Appendix 2 set out potential data sources which can be used to underpin a DiD research design.

The key identifying assumption under which DiD produces robust estimates of the treatment effects is parallel trends. This assumption states that, in the absence of the intervention, differences in the outcome between treatment and comparison groups would have remained constant during the post-intervention period. An advantage of the event study model is that is allows the plausibility of the parallel trends assumption to be tested. The coefficients of the pre-intervention periods (τ_{t) should be individually and jointly indifferent to zero, which would suggest that there were no significant differences between in the trend of the outcome of interest prior to the major event taking place.}

Fixed Effects Difference-in-Difference

The DiD estimator is typically incorporated into a regression framework and estimated using (two-way) fixed effect techniques:

Figure 13: DiD Fixed Effect calculation

Where Y_it is the outcome of interest for unit i in time period t; a_i are time-invariant (or unit specific) effects; b_t are unit-invariant (or time specific) effects; D_it is a binary indicator of unit i’s assignment into the treatment group; and ε_it is the error term. The estimated impact on the outcome of interest of hosting the major event, compared to the comparison areas, would be τ.

When there is expected to be heterogeneity of the treatment effects over time (i.e. the ‘effect’ decays over time, as would be expected in the case of the impact of major legacy events), the DiD estimator can be incorporated into an ‘event study model’. This allows the estimation of the dynamic effects of the event – i.e. can identify a different treatment effect for every year post-event:

Figure 14: DiD estimator incorporated into an ‘event study model’

Where Y_it is the outcome of interest for unit i in time period t; a_i are time-invariant effects; b_t are unit-invariant effects; D_it a binary indicator showing the assignment of unit i into the treatment group at time g; year_k is a binary variable for each year in the analysis period, ε_it is the error term, which captures unexplained variability in the data that is not accounted for by the independent variables.

The coefficients of interest capturing the dynamic impacts of the event (which occurred at time g) are τ_t>g. If the omitted year in the regression is the last pre-intervention year, these coefficients represent the difference in outcomes between year r and the last pre-intervention year, hence the ‘trend’ of the post-intervention effect.

Staggered Difference-in-Difference

In the case where the event moves around a host city over time (creating several ‘treated units’ at different points in time) a staggered DiD set up could be deployed. This set up is robust to potential biases arising from comparisons between later and earlier treatment groups, and from heterogeneous treatment effects across groups.

Callaway and Sant’Anna (2021) propose a non-parametric staggered DiD estimator, designed to compute different treatment effects for each group of units that started treatment (or rather held the event) at different start dates. The group-specific treatment effects can be averaged to estimate an overall treatment effect:

Figure 15: Non-parametric Staggered DiD Estimator

Where the weights p are propensity scores, G is a binary variable that is equal to one for a geographic units first treated in year g, and C is a binary variable equal to one for geographic unit in the comparison group which have never implemented the programme. Equation (Figure 15) gives the treatment effect at time t for the group of geographical units treated at time g, and it is computed by comparing changes in outcomes for group g between periods g-1 to that of a control group of never treated units (C).^{[footnote 24]}

Propensity Score Matching

For the purposes of this analysis, treatment and control area would be broken into geographical units (informed by available data and expected impacts). Each geographical unit from the treatment group is then ‘matched’ with members of the comparison group that share a similar estimated probability of being closely aligned to the event host city (i.e. the propensity score). The propensity score can be estimated using a probit model^{[footnote 25]}:

Figure 16: Probit model for Propsensity Score Matching

This estimates the probability, P, of a geographical unit being within close proximity of the event, D_i=1, given a set of baseline characteristics X_it, recorded from geographical unit i at time t. ϕ is the cumulative distribution function of the standard Normal distribution.

The baseline characteristics (or ‘matching variables’) that are used to estimate the probability that the treatment/ comparison areas are likely to be one of the target areas are set out in Table 4 below.

Table 4: Matching variables used to estimate the propensity score

Indicator	Reason for inclusion	Source
Tourism employment	The cultural or sporting event may generate short term impacts on demand from tourists.	Business Register of Employment Survey, ONS (note that figures before 2015 do not include PAYE only firms and were not included).
Total employment	Short term impacts on demand from tourists and longer-term impacts through increased participation in the workforce.	Business Register of Employment Survey, ONS (note that figures before 2015 do not include PAYE only firms and were not included).
Rateable Value per m²	A proxy for commercial rents, giving an indicator of the underlying productivity of the area.	Valuation Office Agency 2017 ratings list.
Total area of commercial space, m²	A simple measure of the overall productive output of the area (also acts as a proxy for size of zone)	Valuation Office Agency 2017 ratings list.
Share of manufacturing employment (% of total jobs)	A simple measure of the industrial structure of the area.	Business Register of Employment Survey
Jobseekers Allowance Claimants	A measure of slack in the local economy and ability to absorb additional economic growth.	DWP Benefits Database
Long-term Jobseekers Allowance Claimants	Measure of longer term social issues associated with worklessness	DWP Benefits Database
Gross Weekly Earnings (median, local authority level)	Another measure of the underlying productivity of the local economy.	Annual Survey of Hours and Earnings

A comparison group can then be established by matching each treated geographical unit with one (or more) untreated geographical units that have similar values in terms of their propensity score. There are several matching algorithms that can be used to construct a comparison group. A brief overview is presented below:

Nearest neighbour: each treated unit is matched to the nearest (k) unit(s) in the comparison group that is closest in terms of their propensity score. Units within the comparison group can either be used once under ‘no replacement’, or multiple times under ‘replacement’.

Caliper: Using the ‘nearest neighbour’ algorithm, the caliper ensures a minimum quality match between treatment and comparison units in terms of the proximity of propensity scores to one another. Those observations that do not yield a match within a set distance (i.e. the caliper) are dropped from the treatment group. Note this may limit the external validity of the results.

Radius: All comparison units within a set distance of the treated units propensity score are included within the analysis. Where there are no comparison units within the radius, the corresponding treated unit is dropped. Note this may limit the external validity of the results.

Kernel: Every treated subject is matched with a weighted average of control subjects

The outcomes of the treatment and matched comparison units can then be compared over time. Units that are not matched are excluded from the analysis.

PSM can be combined with DiD to increase the robustness of the analysis by controlling for observable and unobservable confounding effects. PSM alone can only account for biases due to systematic differences in observable characteristics – combining the propensity score with DiD increases the robustness by controlling for unobservable time-invariant confounders.

To combine the two approaches, PSM is used to firstly form a robust comparison group (i.e only treatment or comparison units that return a ‘match’ are included in the analysis, all other units are dropped), before using DiD to estimate the treatment effects. Combining the two approaches reduces differences between treated and control units by selecting units that share similar characteristics, and therefore would be expected to exhibit similar outcomes in the absence of the intervention. This will improve the robustness of the analysis.

Synthetic Control

As set out in Abadie and Gardeazabal (2003)^{[footnote 26]} the synthetic control estimates the effect of the intervention of interest (i.e. the major event) by differencing the observed outcome of interest from a weighted combination of the outcome of interest of the donor pool. In the case of one treated unit (e.g. a country), the synthetic estimator would be:

Figure 17: Synthetic Control Estimator

Where τ_1t is the estimated treatment effect for the treated unit (j=1) at time t; Y_1t is the observed outcome of interest for the treated unit; w_j are the weights assigned to the donor pool; and Y_jt are the observed outcomes of interest for the donor pool.

Weights are chosen so that the resulting synthetic control best resembles the pre-intervention values for the treated unit(s) of predictors of the outcome variable. Abadie and Gardeazabal (2003) propose setting synthetic control weights such that the distance between a vector of predictor variables for the treated unit, X₁, and a vector of weighted predictors variables for donor variables, X₀W, is minimised. This is achieved by minimising equation (Figure 18):

Figure 18:

subject to the restriction that weights, W=(w_2’,…,w_J+1), are non-negative and sum to one.

v_h are positive constants that reflect the relative importance of the synthetic control reproducing the values of the predictors for the treated unit (X₁). Abadie and Gardeazabal (2003) recommend selecting the value of v_v such that the mean squared prediction error (MSPE) of the synthetic control with respect to the value of the counterfactual outcome for a set of pre-intervention periods Τ₀:

Figure 19:

Recent additions to the causal inference literature have sought to combine the synthetic counterfactual with the DiD estimator (SDiD). Arkhangelsky et al. (2021)^{[footnote 27]} describe how the SDiD estimator combines attractive feature of both the synthetic control and DiD: using the synthetic control to re-weight and match pre-exposure trends to reduce the reliance on the parallel trends assumptions, and using DiD to remove unit level and time invariant effects.

The SDiD can be implemented by using the synthetic control to estimate the weights for each outcomes variable, where the weights are used in a two-way fixed effects (event study) regression.

Two sets of weights are estimated: unit weights and time weights. The unit weights are constructed such that the average outcome for the treated units shares an approximately similar trend to the weighted average of the control units. Time weights are designed to ensure a constant difference between the average post-treatment outcomes for each of the control units and the weighted average of the pre-treatment outcomes for the same control units.

In this set up, more weight is put on units that are on average similar in terms of their past to the treated units. More weight is also applied to time periods that are on average similar to the treated periods. Arkhangelsky et al. (2021) state that the use of time and unit weights can remove bias and improve precision of the estimate (by placing less weight on time periods that are very different from the post-treatment periods).

Question 10

References

Accepted Answer

Gratton, C., and Preuss, H., 2008. The Conceptualisation and Measurement of Mega Sport Event Legacies. [online] Available at: https://www.researchgate.net/publication/240535313_The_Conceptualisation_of_Measurement_of_Mega_Sport_Event_Legacies ↩
HM Treasury, 2020. The Magenta Book: Guidance for Evaluation. [pdf] Available at: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/879438/HMT_Magenta_Book.pdf ↩
UK Sport, EventIMPACTS, n.d. Available at: https://www.eventimpacts.com/ ↩
Association of Summer Olympic International Federations (ASOIF), n.d. ASOIF Common Indicators for Measuring the Impact of Events. [pdf] Available at: https://www.asoif.com/sites/default/files/download/asoif_common_indicators_for_measuring_the_impact_of_events.pdf ↩
OECD (2023), “How to measure the impact of culture, sports and business events: A guide, Part I”, OECD Local Economic and Employment Development (LEED) Papers, No. 2023/10, OECD Publishing, Paris, https://doi.org/10.1787/c7249496-en. ↩
It would be plausible to expect however that the increased participation and sense of pride are larger closer to the event, with the effect decreasing as the distance from the event decreases – this can be accounted for through distance-decay models and is explored in Section 2.8.1 ↩
Garcia, B., n.d. The benefits of going through the competitive process are shared through interviews with UK Cities of Culture runner-up applicants. UKCC Interviews – Final. [pdf] Available at: https://assets.publishing.service.gov.uk/media/67e3e2cb7fd10a62fac3ea7f/ACCESSIBLE_Copy_for_publication_Dr._Garcia_-UKCC-Interviews-FINAL.pdf -Interviews-FINAL.pdf ↩
Additional permutations would also be included, e.g. Coventry and Bradford would serve as controls for Kingston upon Hull and so on. ↩
Note, the ONS is currently introducing the Integrated Data Service (IDS) as a successor to the SRS. It is not know the extent to which the below issues would be applicable within the IDS. ↩
Using SIC codes defined in ONS (2010) Measuring Tourism Locally ↩
UK Data Service, n.d. Continuous Household Survey. [online] Available at: https://beta.ukdataservice.ac.uk/datacatalogue/series/series?id=200008#!/access-data ↩
Scottish Government, Ipsos MORI. (2021). Scottish Household Survey, 2019. [data collection]. UK Data Service. SN: 8775, DOI: http://doi.org/10.5255/UKDA-SN-8775-1 ↩
Office for National Statistics, Welsh Government. (2024). National Survey for Wales, 2022-2023. [data collection]. 2nd Edition. UK Data Service. SN: 9144, DOI: http://doi.org/10.5255/UKDA-SN-9144-2 ↩
Sport England, Ipsos. (2025). Active Lives Adults Survey, 2020-2021. [data collection]. 2nd Edition. UK Data Service. SN: 8993, DOI: http://doi.org/10.5255/UKDA-SN-8993-2 ↩
Department for Digital, Culture, Media and Sport. (2024). Taking Part: the National Survey of Culture, Leisure and Sport, 2019-2020; Adult and Child Data. [data collection]. 2nd Edition. UK Data Service. SN: 8745, DOI: http://doi.org/10.5255/UKDA-SN-8745-2 ↩
ScotCen Social Research. (2023). Scottish Health Survey, 2021: Special Licence Access. [data collection]. UK Data Service. SN: 9083, DOI: http://doi.org/10.5255/UKDA-SN-9083-1 ↩
Note, at the time of writing the ONS is currently transitioning from the SRS to the Integrated Data Service (IDS). We anticipate that the two will serve a similar function and will be built on the same data handling principles – however we urge practitioners to independently research this transition to understand the likely impact on their own research. ↩
The GBTS can be accessed here: https://gbtsenglandlightviewer.kantar.com/ViewTable.aspx ↩
Department for Work & Pensions, n.d. Stat-Xplore - User Guide. [online] Available at: https://stat-xplore.dwp.gov.uk/webapi/online-help/User-Guide.html ↩
UK Government, n.d. Live Tables on House Building. [online] Available at: https://www.gov.uk/government/statistical-data-sets/live-tables-on-house-building ↩
HM Land Registry, n.d. Price Paid Data. [online] Available at: https://landregistry.data.gov.uk/app/ppd/ ↩
HM Revenue & Customs, n.d. Regional Trade Data. [online] Available at: https://www.uktradeinfo.com/trade-data/regional/ ↩
Nomis, n.d. Review of the Labour Market Statistics. [online] Available at: https://www.nomisweb.co.uk/articles/1342.aspx ↩
This can easily be adapted to use only not-yet-treated units for cases where the pool of never-treated units are small or non-existence (such as in a pipeline design). ↩
Note, a Logit model could also be used to estimate the propensity score, without affecting the size and significance of the coefficients in the propensity score. Logit and Probit models typically yield very similar results in the absence of extreme outliers. ↩
Abadie, A. and Gardaezabal, J., 2003. The economic costs of conflict: A case study of the Basque Country. American Economic Review, 93(1), pp.113-132. ↩
Arkhangelsky, D., Athey, S., Hirshberg, D., Imbens, G. and Wager, S. (2021) ‘Synthetic Difference-in-Differences’, American Economic Review, 11(12),pp.4088-4118. ↩

Cookies on GOV.UK