Capturing engagement numbers - strand 1 report: annex 3: Strava case study

Question 1

1. Executive Summary

Accepted Answer

This case study describes a method of working with Strava Metro to estimate participation in parkruns and how that could be generalised for similar events. Strava is an example of activity data (hereafter referred to only as Strava). Strava data best represents those who engage in what is termed ‘human powered’ activities and who take the time to self-report their data into the Strava App. For this case study parkrun data was used as a ground truth^{[footnote 1]}:

against which to test the Strava data
to provide the scaling factor to produce estimates (which could apply to a range of locations)

The case study will provide an overview of Strava, the rationale for why it was prioritised, the methodology, the results and any limitations and challenges encountered. At a high-level, initial analysis shows:

The data is most suitable for sporting and recreational activities.
Using parkrun data from two different Edinburgh-based parkruns, there is a moderate positive correlation with the estimates and recorded number of participants, indicating the method is well-designed for this type of event.
Unexpected predictions by the model are often adequately explained by missing parkrun data or altered event times.
However, it appears calibration is necessary on a per event level as even similar events (such as two parkruns at different locations) can have different scaling.
Running this exercise at a larger number of events – comparing recorded attendance with estimates via the Strava method, and refining the method based on the results – will provide more robust estimates across a range of events.
The Strava data can be used for other events/spaces provided it is applied to those taking place in parks/green spaces, which is where it best estimates the use of human powered transport.

The high-level performance of the model was evidenced by its ability to accurately estimate usage of the app to record activity on Saturdays, which is when recreational and sporting activities are at their busiest in parks and green spaces. The model tested predicted attendance at Edinburgh parkrun in 2023 using corresponding Strava data from the same year with a moderate positive correlation of (r=0.491)^{[footnote 2]}.

Given the result, it is likely Strava data could be used in other events and locations to help calibrate estimates by providing additional data into the models if ground truths exist to test against.

Question 2

2. Introduction

Accepted Answer

2.1 Background

The Department of Culture, Media and Sport (DCMS) is exploring new methods to measure engagement with cultural and sporting events and activities to understand and facilitate engagement across its sectors.

Currently the data has been mostly collected using traditional methods. At an event-specific level this has been done, for instance, by manually counting attendees. At a broader level, engagement is measured via surveys, which have several limitations, including their cost, small sample sizes and recall bias. That being said, well-designed and comparatively large surveys, such as DCMS’s Participation Survey and Sport England’s Active Lives Survey, give robust measures of engagement at a nationally representative level for a range of cultural and sporting activities.

The main challenge for estimating engagement with sports is with un-ticketed events, particularly those with no point of entry, which might include playing for a local sports team or taking part in locally organised sports and recreation activities using small scale venues / facilities. This challenge also extends to taking part in a drop-in activity in an open area as part of a larger cultural programme of events.

DCMS commissioned a research team led by Verian and including world-leading data scientists at Faculty AI and academic experts from the School of Computer Science and Urban Big Data Centre at the University of Glasgow to undertake an R\&D Science and Analysis project on ‘capturing engagement numbers’. This project provides an opportunity to look at new methods and data sources (for example, using mobile app data or social media posts), which could go a long way to strengthening the insights gained from existing methods and filling some of the current gaps in the ability to estimate attendance at events, spaces and locations across the UK.

Faculty AI and Verian sought and gained access to the Strava data through a dashboard provided by Strava following a referral by Sport England during the ‘Breadth Phase’ of the project. Free access was granted following a letter of recommendation from DCMS setting out the aims for the project. The Strava data accessed for initial testing dates from 2019 to the present and covers Edinburgh including surrounding countryside. In the process of the application to gain access, we were required to give an initial area of study. Given that we had ticketed and un-ticketed events we were considering in and around the city, Edinburgh was considered a suitable choice. Our current understanding is that access to the Strava data is limited by geographic area to the chosen area of study. This means data cannot be extracted through the dashboard on a country-wide basis, or by activity type, for example.

2.2 Research Assessment Criteria

The research criteria below are taken from the initial research questions from the Breadth Phase of Strand 1 and adopted as the most appropriate ways to assess data sources for prioritisation. Detailed below are the conclusions against the research assessment criteria:

Accuracy	Prior to testing accuracy was established by reviewing uses in published peer reviewed literature and through discussion with Sport England who utilise Strava data. Able to predict attendance at Edinburgh parkrun in 2023 using corresponding Strava data from the same year with a moderate positive correlation (r=0.491).
Bias	The data is self-reported via the Strava application so there would be a bias towards those engaging in sporting and recreational activities and digitally record their physical activities, as well as biases towards demographics more likely to take part in activities recorded on Strava (particularly cycling semi-competitively). An additional consideration is rural populations where poorer service means their GPS is less accurate, results in poorer quality data – and therefore estimates – in the modelling outputs.
Ethics	Results are aggregated to the nearest 5 so individuals cannot be personally identified. Further modelling work will be required in subsequent phases to identify the specific threshold for which there is confidence individuals can’t be re-identified even if estimates are combined with other data sources.
Deliverability	Useful for green and open spaces but is not applicable to urban or built-up areas.
Cost	There was no cost for access given an agreement negotiated between the consortium, DCMS and Strava. However, use in future may come with a cost (depending on future agreements between DCMS and Strava). A low costs machine learning model was used, meaning any compute costs would be negligible.
Demographics	By geographically linking the areas from which people travel to events with Census demographic data, it is possible to determine the demographics of the groups more likely to be engaging with these events.

Other sources of activity data were assessed by this project, including Garmin, which helps to track a user’s health and wellness journey. It also requires self-reporting by users, like Strava, for instance by including their daily health and fitness data into an app. However, Garmin was deprioritised as their Activity API did not provide aggregate data, which is required for both robust modelling approaches and to protect privacy. Strava could provide aggregated access through their dashboard.

Question 3

3. Strava – Physical Activity Data

Accepted Answer

3.1 Overview of Strava data

Physical activity data covers all data captured about sporting and recreational activities, whether it is using self-reported app data or wearable technology, which can also overlap. Strava is largely an example of the former. In addition to providing raw counts of individuals travelling by a particular mode of transport, some data providers also deliver locational insights, such as the start and end destination of the individual. As such, this data can be used for two purposes:

as a parameter to estimate attendance and participation at an event
to determine where the audience is coming from, which, in turn, can enable estimates of audience demographics by geographically linking Census demographic data with the areas from which people travel to events.

Strava data is aggregated data recorded by users of the Strava app where the activity is separated into cycling (ride/e-bike ride) and pedestrian (walk/run/hike). Events which involve multiple participants engaging in a physical activity, such as parkrun, often generate extensive data within Strava as participants seek to record their performance. Therefore, through analysing Strava data correlated to a known event, an accurate estimation of attendance of the event can be generated. These results can be compared to the known attendance to determine the model’s accuracy. As detailed within this case study, this approach was applied to the Edinburgh parkrun using the corresponding Strava data from 2023 and was able to predict the event’s attendance to a moderate positive correlation (r=0.491).

A further benefit of Strava is that it can give insights into the number of individual users recording trips and their demographics, such as age group and sex. Using Strava it may be possible to analyse those more likely to be participating in running, cycling and walking, especially in parks and green spaces, where the app is most in use.

Potential use cases of the Strava data are:

Measuring participation at public sporting events
- Competitive running and cycling events on public streets, paths and in parks
- Community runs and walks, including charity walks
- Parkruns and other local or community organised active events

3.2 Strava data: Content and Format

The data is presented in a dashboard which shows a map, filter functions and the data on the right-hand side, as seen below in figure 1.

Figure 1: Strava data dashboard

Question 4

4. Methodology for Strava data

Accepted Answer

4.1 Central Question

The primary objective of this project is to estimate attendance at cultural or sporting events and locations using only secondary data sources (i.e. with no direct monitoring of attendance) describing usage patterns in the area at the time of the event.

Models are built using monitored events with a reliable record of attendance to obtain a clear picture of how initial estimates produced compare to a reliable baseline. This process allows for the development of a scaling factor for various events that will then allow for estimating attendance at unmonitored events.

The approach developed and tested to date in Strand 1 means the scaling factor used would need to be re-calibrated to predict attendance for another non-park run event in this location. This is because that scaling factor was derived only from parkrun data and we therefore cannot use it for other events. This would assume that Strava usage prevalence at the new event would be the same as for the parkrun. Future development work in Strand 2 will therefore be geared towards developing a new scaling factor based on comparable one-off events for which a true attendance is available.

This will mean development of a model iteration using a scaling factor trained across lots of different running events such as parkrun, non-parkrun, and use a wider range of locations. This modelling approach will generalise better and predict attendance at a variety of races, starting with the Great North Run.

This case study sets out the first stage in this process by examining attendance at parkruns in Edinburgh in 2023, using local Strava usage data as our secondary data source. Since Strava is used primarily for recording recreational exercise, this is a sensible choice of target event. Parkruns also have weekly recorded finishers that event organisers publish online (and so is publicly available), which can be used to define a scaling factor for parkruns and similar events, alongside validating the methodology.

Whilst the Strava data only covers a limited use case, methods developed for the Strava data can be used for other similar data sources, i.e. other sources which record usage or attendance in a set of locations over periods of time. Thus, when developing a method for the Strava data, analysts should keep in mind this possible extension to other data sources.

4.2 Challenge

There are two primary issues with the Strava data (and likely with any similar data source):

Not everyone who attends an event will use Strava to record their presence in the event area.
Not everyone using Strava in the event area is attending the event.

Any method developed must account for both issues. Note that for the second, it may be true that for a closed event, all persons in the event area are attending the event, but since most closed events are ticketed in some way, events of this type are unlikely to be considered for this project.

In addition to the primary issues, there are issues of secondary concern which apply specifically to this Strava dataset but may apply to other datasets as well:

Results are aggregated to the nearest 5 to maintain data privacy. Though this effect is largely inconsequential for significantly large usage, for smaller usage this can have a pronounced effect on estimates.
Not all paths through the park have recorded usage, and it is unclear whether this is due to Strava not recording them as paths or whether they are just unused by users of Strava.
There is double counting: if a user passes in one direction on a path and returns the other way, they are counted twice. There may also be higher-order multiple counting in certain situations.

These features and limitations of the dataset inform our choice of methodology.

4.3 Methodology description

The methodology is as follows, see diagram at figure 2:

Figure 2: Strava methodology

Estimate the background Strava usage at the time of the event: Train a model M to estimate expected Strava usage M(E) at time of event E, see below for training details. This estimates the background usage at the time of the event, i.e. the predicted usage in the absence of an event.
Capture Strava usage during the event: Calculate the actual Strava usage S(E) at the time of the event from the Strava data.
Estimate event attendance of Strava users: The estimated event attendance A(E) is the difference A(E) = k(S(E) - M(E)), where k is a scaling factor dependent on the event.
Estimate the scaling factor: Estimate the scaling factor k using either ground truth data for certain days of the event if possible, or by using scaling factors from other similar events where attendance is monitored.

The methodology accounts for the two primary issues raised above. Subtracting the predicted usage in the absence of an event from the actual usage at the time of even removes the issue of including Strava users at the location who are not attending the event. The scaling factor, specific to the event, or event type, accounts for attendees who don’t use Strava, and resolves the double-counting issue.

4.4 Training details

The model is trained to predict usage on a given day at a given time using recorded usage of the prior week. The training data used is a subset of paths in the case study area, in this case, Edinburgh. The model learns how usage patterns in the prior week are likely to influence usage patterns on the event day, if the event was not to be held. Consequently, the model is trained to predict the baseline attendance in the absence of the event.

XGBoost, a high-performance boosted decision tree model (a type of machine learning), was used for this case study. However, this can be substituted for any model that can be trained with regression as a loss function. XGBoost was selected because it is known to be one of the fastest yet highest performance models available for tasks such as this.

The dataset contained a series of items, each of which is a path in Edinburgh with its hourly usage over the course of 1 week. The dataset was split into a training and test set by geographic location of the path. This is done to avoid contamination of the training and test sets caused by nearby or connecting paths having dependent usage statistics.

Edinburgh is divided into 4 quadrants (North-East, North-West, South-West, South-East) and, since the Edinburgh parkrun is in the North-East of Edinburgh, we assign the first as the test set and the latter 3 as the training set. Since the parkrun is held on Saturday morning at 9am, the Saturday usage is estimated from the prior Sunday-Friday usage. It makes sense to isolate the Saturday usage in case there are aftereffects from the event being held, such as the area being less used later in the day due to event clean-up, or a larger proportion of those that would normally run the route having already run it in the morning.

Question 5

5. Results

Accepted Answer

The main conclusions at this stage of development are as follows:

The data is most suitable for sporting and recreational activities.
A moderate positive correlation with the recorded number of finishers has been achieved using parkrun data from two different Edinburgh-based parkruns. This indicates that the method is suitable for this type of event.
Many unexpected predictions by the model are adequately explained by missing parkrun data or altered event times.
However, calibration is necessary on a per-event level as even similar events (such as two parkruns at different locations) can have different scaling. This indicates that it is necessary to record attendance manually at a small sample of events so the methodology can be applied more broadly to all events within a location.

5.1 XGBoost training results

At this stage, for paths and weeks without parkruns or other events, the model is doing a good job of estimating the Saturday usage. This claim is validated by analysis of three randomly sampled paths from the test set, with their weekly usage in blue, and the estimated Saturday usage in dashed yellow.

Figure 3: Daily Strava usage (blue) and model estimated Saturday usage (orange) at a medium traffic area (local Edinburgh park path)

Figure 4: Daily Strava usage (blue) and model estimated Saturday usage (orange) at a low traffic area (small side street)

Figure 5: Daily Strava usage (blue) and model estimated Saturday usage (orange) at a higher traffic area (Water of Leith riverside path)

The model estimates Saturday usage, and performance is better and more reliable when Strava usage is higher. This is likely due to the aggregating of data: Strava usage numbers are aggregated to the nearest 5 persons which has a much larger effect on the data when usage is low than when it is high.

There is a greater disparity between the prediction and actual usage for the location of the parkrun. This disparity is used to estimate the true parkrun attendance.

Figure 6: Daily Strava usage (blue) and model estimated Saturday usage (orange) at a park run location. Unlike in figures 3-5, the model drastically underestimates the usage since the data from the rest of the week does not predict that a park run is happening

5.2 Estimating parkrun attendance

The real test of the model is how it performs as part of the pipeline to estimate parkrun attendance. Comparing the estimated 0900 Strava usage to the actual 0900 Strava usage, we obtain the following results across the year 2023:

Figure 7: True Strava usage (blue) against model-estimated Strava usage (yellow) for Saturday at 0900 at the Park Run location in the year 2023. Note that the model-estimated Strava usage is much lower since the model does not have access to the information that a parkrun is happening on that date. This is by design.

Figure 8: Model-estimated Park Run attendance (green) against park run recorded finishers (red) for year 2023. The model-estimated park run attendance is formed by subtracting the yellow line from the blue line from figure 7. We can see a good correlation between the two lines.

Overall, we have a moderate positive correlation (r=0.491) between the model estimated attendance and ground truth number of finishers over the year 2023. At this stage, we do not expect the two lines to be at the same height as we have not accounted for scaling (see Generalisability).

It is also worth noting that there are several weeks (weeks 3, 39, 41, 47, 51) where no parkrun occurred. This is corroborated by the substantially lower usage numbers.

Contributing negatively to the correlation score are the two anomalous weeks 17 and 32, which respectively featured dramatically higher and lower estimated attendance than number of finishers. In the latter case, this may be due to the event starting half an hour later than usual that week. Analysing the data for week 17, there were other days in the prior week with anomalous usage, indicating the possible presence of other events. This could have caused the dramatically higher estimated usage that week, highlighting one of the failure modes of this methodology. Further work could attempt to address these sorts of failures.

We estimate the scale factor by finding the value of k which minimises the mean absolute error between k*A(E) (see methodology above) and the number of finishers. In this case, this is achieved with a scale factor of k=1.101. After scaling the mean absolute error is reduced to 44.06 from 63.41.

Figure 9: Similar to Figure 8, but with scaling applied to the green line to obtain the scaled-estimated park run attendance (blue). This increases the accuracy of the predictions, though data is required to make such a scaling.

5.3 Generalisability

Given the strong results above, the natural question is whether the results above generalise. Our findings are as follows:

The method generalises well, showing equally strong results on a second Edinburgh parkrun (see below).
However, the scale factor does not generalise well across the two parkruns, with substantially different scale factors being recorded for each.

Figure 10: Portobello parkrun, eastern Edinburgh - recorded finishers compared to model-estimate attendance

The graph above shows recorded finishers for the Portobello parkrun in eastern Edinburgh compared with the model-estimated parkrun attendance, computed in a similar way to the above. There is a stronger correlation, with a higher Pearson r-coefficient of 0.598. However, the estimated attendance is higher than the parkrun recorded finishers, giving a scale factor of k=0.8156 (as opposed to k=1.101 above for the Edinburgh parkrun).

This shows that the scale factor does not generalise smoothly even across similar events such as parkruns and hence must be computed from ground truth data for each event. Possible explanations for this discrepancy are:

Strava records data differently depending on the paths. There may be hidden irregularities in the way the data is recorded or aggregated that result in double counting in certain situations.
Runners in some parkruns are more likely to use Strava than in other parkruns, possibly due to demographic differences in participation.

Work was attempted to analyse the Holyrood Parkrun, however, due to the limitations of the OSM reference system, it was not possible to retrieve consistent usage information for this parkrun– possibly due to the way the course is split over two neighbouring parallel paths. This highlights an additional limitation of the data source for this purpose.

Question 6

6. Considerations and Limitations

Accepted Answer

Limitations	Description	Mitigations
OSM Reference ID	Is inconsistently recorded, potentially limiting wider implementation of methodology.	Unclear. Further development work and model testing required to explore work around, as well as follow-up activity with Strava and potentially Sport England to understand further.
Operational Issue	Unclear how Strava deals with double counting, meaning this methodology could over- or under-report this issue	Unclear. Further development work and model testing required to explore work around, as well as follow-up activity with Strava to understand further.
Accuracy	Need some ground truth data to calibrate results. Negates usage of method for one-off events, limiting its wider applicability.	Further testing required to explore development of scaling factors.
Limited Applicability	Strava only useful for running/cycling and exercises	The nature of the data source means it is only intended for sporting/recreational activities, so care should be taken to ensure methods developed using Strava are applied to appropriate use cases.

7. Conclusion and Future Steps

Strava data is best positioned for use when estimating attendance at sporting and recreational activities in parks and open spaces, as it is catered to those who engage in human powered activities and who take the time to self-report their data into the Strava App. The method applied to the Edinburgh parkrun using the corresponding Strava data from 2023 predicted the event’s attendance to a moderate positive correlation (r=0.491). This was due to the use of the parkrun data as a ground truth against which the estimate could be tested. This shows that the Strava data could be used in other events and locations if ground truths exist for each event to test against. Without ground truth data for the specific locations in question, it will be difficult to estimate attendance with Strava data alone at this stage of our development. This results from several factors such as:

Not all those participating in recreational and sporting activities will use Strava
Not everyone using Strava is necessarily attending a specific event
Strava records data differently depending on the paths in use and not all paths through the park have recorded usage; and it is unclear whether this is due to Strava not recording them as paths or whether they are just unused by users of Strava
There is double counting: if a user passes in one direction on a path and returns the other way, they are counted twice. There may also be higher-order multiple counting in certain situations. Therefore, the numbers counted may not be an accurate reflection of the actual number of people.

Overall, Strava is a valuable source which is worthy of continued experimentation while considering its limitations. The best way to maximise its potential is to apply it to other events/spaces/locations across the country. This would mean engaging with Strava representatives to expand the current remit beyond Edinburgh.

Ground truth data is the baseline data used to test the Strava data against. The closer the results produced by the model using the Strava data was to the parkrun data, the better the model in performance. Without ground truth it is hard to know if your model has produced accurate results in testing. ↩
A moderate correlation means one which typically falls between 0.3 and 0.7, as it suggests a moderate relationship between the parkrun data and the Strava data from the same year. Modelling in this case study falls within this range. ↩

Capturing engagement numbers - strand 1 report: annex 3: Strava case study

1. Executive Summary

2. Introduction

2.1 Background

2.2 Research Assessment Criteria

3. Strava – Physical Activity Data

3.1 Overview of Strava data

3.2 Strava data: Content and Format

4. Methodology for Strava data

4.1 Central Question

4.2 Challenge

4.3 Methodology description

4.4 Training details

5. Results

5.1 XGBoost training results

5.2 Estimating parkrun attendance

5.3 Generalisability

6. Considerations and Limitations

7. Conclusion and Future Steps

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK

Cookies on GOV.UK

1. Executive Summary

2. Introduction

2.1 Background

2.2 Research Assessment Criteria

3. Strava – Physical Activity Data

3.1 Overview of Strava data

3.2 Strava data: Content and Format

4. Methodology for Strava data

4.1 Central Question

4.2 Challenge

4.3 Methodology description

4.4 Training details

5. Results

5.1 XGBoost training results

5.2 Estimating parkrun attendance

5.3 Generalisability

6. Considerations and Limitations

7. Conclusion and Future Steps

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK