Capturing engagement numbers - methodology toolkit
Published 13 March 2026
This report was authored by Jack Medlock, Hannah M. P. Stock, Andrew Knight, Donna Phillips, Adam L. Ozer, and Joseph Stordy at Verian, Dr Michael Sinclair, Dr Craig Macdonald, and Prof Iadh Ounis at The University of Glasgow, and Faculty.
This research was supported by the R&D Science and Analysis Programme at the Department for Culture, Media & Sport (DCMS). It was developed and produced according to the research team’s hypotheses and methods between October 2023 and June 2025. Any primary research, subsequent findings or recommendations do not represent UK Government views or policy.
1. Executive Summary
While ticketing data gives a good understanding of engagement with ticketed cultural and sporting events, measuring engagement at unticketed events is difficult and often relies on surveys. This is particularly true of unticketed events where points of entry are hard to measure, as this limits the potential to manually count attendees in a given time.
Although survey data is of good quality, can provide demographic insight and provides a replicable approach, surveys come with their own challenges and limitations. They can suffer from recall bias and can be limited in their ability to comprehensively measure local participation at specific events or spaces. At locations where traditional methods such as ticket sales or crowd counts are not possible, new data-driven methods may be more accurate and cost effective than ‘analogue’ approaches.
Further challenges are presented when trying to assess engagement, where information such as demographics and duration of attendance would be useful. This project has developed and tested the viability of new methods of measuring participation to overcome these challenges and understand whether they can complement or improve on existing methods.
As part of this programme, this methodology toolkit has been developed to support Government analysts through the key steps involved in leveraging modern, data-intensive approaches to predict attendance at sporting and cultural events. It also documents the key decision-points analysts will be confronted by when deploying these methodologies and provides guidance to help inform those decisions. Key considerations outlined in this toolkit include:
-
How to identify data sources suitable for use in predicting attendance
-
How to evaluate the viability of a range of data sources for different use cases and event types
-
How to safely and securely access data sources
-
How to select the optimal data source and approach, given event type and available data
-
How to evaluate the performance of the data set in the methodology developed
Taken together, these sections seek to equip analysts with the insights and approaches necessary to predict attendance at sporting and cultural events, including highlighting key decision-points at each stage of the process.
A key outcome of this research was to help analysts to evaluate the suitability of data sources for predicting attendance at events with varying characteristics. The diagram below is a decision-tree that seeks to summarise a representative set of decisions and pathways a team of analysts might be confronted with in seeking to select a data set to predict attendance at an event.
Figure 1: Data source selection decision tree
Importantly, the decision tree does not outline all possible considerations in an exhaustive way for all events, nor are the pathways mutually exclusive. Instead, it seeks to represent a range of possible conditions and scenarios that analyst may be confronted with in predicting a hypothetical event and, in those cases, make some recommendations for the data types that might be useful.
Building on the decision tree, the table below summarises the strengths and weaknesses of the four key data categories explored in this research against two main groups of metrics:
Analytical scope: Metrics measuring the extent to which the data type is suitable for predicting attendance at events of different event sizes, durations, types and locations.
| Social Media Data | Mobile App Data | Activity Data | Aerial Photography Data | |
|---|---|---|---|---|
| Event Size | Suitable for larger events only – minimum c.1000 attendees | Suitable for events of all sizes | Suitable for events of all sizes | Suitable for events of all sizes |
| Event Duration | Suitable for events of all durations | Suitable only for long running events over multiple days | Suitable only for short events running over several hours | Suitable only for short events running over several hours |
| Event Type | Suitable for sporting and cultural events | Suitable for sporting and cultural events | Suitable for sporting events only | Suitable for sporting and cultural events |
| Event Location | Suitable for rural and urban events at indoor and outdoor locations | Suitable for rural and urban events at indoor and outdoor locations (where a good mobile signal is available) | Suitable for rural and urban events at indoor and outdoor locations | Background features of some rural, urban, indoor and outdoor events limits suitability |
Operational constraints: Metrics measuring the budgetary, ethical and accessibility considerations associated with using that data source to predict attendance.
| Social Media Data | Mobile App Data | Activity Data | Aerial Photography Data | |
|---|---|---|---|---|
| Budget | Data subscription in low thousands per location | Data subscription in low thousands per location | Data subscription freely available for research purposes | Data accessible freely online or in agreement with event |
| Ethics | Manageable risk of personally identifiable information | Manageable risk of personally identifiable information | Anonymised data reduces risk of personally identifiable information | Manageable risk of personally identifiable information |
| Access | Access requires interaction with platform APIs or working through a managed service like Pulsar | Access requires participants to have network-connected mobile devices | Access requires working through the Strava Metro platform. Data download times are slow | High-vantage aerial footage of crowds that lasts full event duration is hard to access |
2. Introduction
Social researchers at Verian, in partnership with academic experts at the University of Glasgow and data scientists from Faculty AI, have led a research and development (R\&D) study into new methods to measure participation and attendance at sporting or cultural events.
Survey-based approaches have traditionally been used to capture engagement numbers, owing to their broad reach, potential for statistical power and ability to track change over time. They are, however, potentially resource intensive and in some scenarios less able to apply the same level of measurement achieved on a larger scale to smaller local, unticketed events. This research focused on the extent to which that insight can be provided by alternative means, specifically combining non-survey data with statistical modelling techniques, including using Artificial Intelligence. The objective was to assess the extent to which the methods developed can provide a useful measure of engagement at unticketed events or locations and activities outside the home (for example, participating in sporting activities in a park).
A wide range of data sources were considered in this research. However, the following toolkit is deliberately developed to be agnostic towards any specific data sources, sporting or cultural event, or methodology. The considerations and recommendations outlined here should be broadly applicable to as many combinations of data, methodologies and event-types as might be in scope for this research. In this way the toolkit is intended to be ‘future-proof’ and useful to analysts even as the specific data sources and events change.
However, this toolkit was developed in the context of a research project that developed bespoke modelling approaches to predict attendance at five specific target events: Bradford City of Culture, the British Museum, the Giant’s Causeway, Women’s Euros Final screening and the Great North Run. Accordingly, throughout the report there are documented specific tips-and-tricks that reflect on how the research team worked with the datasets to predict attendance at these specific locations.
Overall, the objective has been to create a generalisable body of guidance that can be referenced by analysts to help identify, access, model and evaluate data as part of a modelling approach to predict attendance at a wide range of sporting and cultural events.
3. Identifying data sources
The research underpinning this toolkit identified a longlist (provided separately in Strand 1 as the data catalogue) of key data sources potentially useful for predicting attendance at unticketed events. The data sources were categorised to enable broader discussion and development of methodology according to the way in which the data was sourced and could be utilised.
These categories include:
-
Mobile app data: third party providers, eSIMS, Wi-Fi connections.
-
Social media data: from third parties or via social media companies’ APIs
-
Activity and transport data: Activity tracking apps, parking data, traffic monitoring, passenger numbers on trains.
-
Deployable sensing data: Fixed cameras, drones, wearables, radio frequency identification tags.
Sections 3A to 3C below provides a summary of each data category, including an indication of when it would be appropriate to use it, and examples of specific data sources and vendors of interest.
3.A. Identified data categories
Mobile phone app location data
Summary: Mobile phone app data can be utilised by working with providers who aggregate individual-level GPS data obtained from the usage of mobile applications on GPS-enabled devices. The data are routinely gathered by third party companies through Software Development Kits (SDKs), spanning a broad spectrum of mobile phone applications, including those for navigation, health, shopping, and weather, all under the umbrella of informed consent.
When to use it: Best suited to predicting attendance at long-running events that recur in a given location. For example, it performed well at predicting annual figures for attendance at the Giant’s Causeway and British Museum. The aggregated nature of mobile app data makes it particularly useful over longer timeframes, including seasonal trends. It also provides insight into average dwell times, allowing analysis on the use of the selected space.
It is less useful for shorter time periods (short-lived events lasting less than one day), and less accurate at events or locations where attendees will not have access to mobile cellular connectivity, which will further limit data availability.
Data sources and vendors: Several data brokers provide these types of data insight; Huq (tested in this project), ActiveXChange, Echo Analytics, UniCast and CKDelta
Social media data
Summary: The tendency of social media users to share their real-world experiences of sporting and cultural events online means social media data can be utilised as a means of capturing engagement with real-world events. This data takes three main forms: text associated with a post; images or videos; and, associated meta data, such as timestamp, author, engagement statistics and sometimes geo-location. These data can either be accessed by keyword searches or via channels.
When to use it: Social media data can perform well at predicting attendance at both one-off and long running events, where those events are the subject of many online posts from visitors and attendees. When deployed in this research, the social media methodology produced its most accurate results in predicting annual figures for the British Museum and participants in the 2024 Great North Run.
Social media data can be analysed to detect implicit occurrences of users engaging with events and locations, and indicating they attended in person. Some social media platforms (such as Instagram) are potentially useful in that they allow for the ingestion of real time posts about in-motion events.
Using social media data requires a minimum digital event footprint i.e. a minimum number of people to post about the event online. Use cases centred around particularly small events, associated with fewer online posts, have poorer predictive outcomes with this data. User-level privacy settings and default configuration of data provider platforms also means that geolocation data is present on only 1-2% of X (formerly known as Twitter) posts.
Data sources and vendors: In recent years, social media platforms have been tightening access controls to their APIs, either by creating stricter access policies in the case of Meta or creating highly expensive pricing structures for access in the case of X (formerly known as Twitter). This restricts the social media data available for analysis compared to previous years.
This has a knock-on effect on the deployable types of analytics. One route to accessing data in alignment with social media platform rules is to use third party aggregation providers. Pulsar, for example, provide a managed service that facilitates large-scale querying of social media data in compliance with the policies of individual platforms. The Pulsar platform provided the social media data tested in this project.
Other social media aggregators provide similar platforms, including Brandwatch, for example.
Activity and transport data
Summary: Flows of people travelling as part of sporting activities or using private (e.g., cars, bikes, walking) and public (e.g., buses, trains, public cycle hire) transport are recorded by a range of data providers, which can be identified and accessed by analysts.
When to use it: The Strava platform gives insights into the number of unique users recording trips and their demographics, meaning that it is possible to track participation in running, cycling, and other forms of active human-powered travel. Activity data performs best at predicting participation in one-off sporting events. Using this data predicted participation in the 2024 Great North Run with a 26% mean percentage error and was more accurate for smaller-scale local Parkruns.
Data sources and vendors: Strava Metro was used for this project. It is a data platform maintained by Strava that aggregates activity data for public agencies in part to provide transport infrastructure to cyclists and pedestrians Transport providers can be reticent to share their data, especially when compared to counterparts in other industries such as the activity dataspace. However, several providers make transport and activity data available to customers, including; TomTom City Mapper, Strava and Metro (tested in this project)
Deployable sensing data
Summary: A range of data can be collected by sensors deployed directly to an event or space on a bespoke basis to monitor attendance or audience size. With the correct quality of data and coverage throughout the event, these sources can be used by analysts to predict attendance.
When to use it: Video footage from deployed cameras performs best at predicting attendance at one-off events, where large crowds are captured in images throughout the course of the event. In this research, it produced its most accurate results in predicting attendance at the ‘Les Giraffes procession’ event during the Bradford City of Culture where high quality drone footage was made available to the researchers. As these data sources need to be deployed in a bespoke manner for each event or space, collaboration from the event or location management will be necessary. Where available, sensing data can be useful for any large set-piece event, especially where the crowd is relatively static e.g. a concert.
Data sources and vendors: This type of sensing can be achieved through visual data, using cameras, CCT or drones, or through Wi-Fi and radio signals.
3.B. Continuous identification of data sources
To ensure a sustainable and effective approach to using data-intensive methods for attendance prediction, it is essential to establish a process for continuously identifying, assessing, and integrating data sources, particularly as new sources continue to emerge. This section outlines key considerations for data science teams within organisations to set up systems for identifying new datasets as they become available.
3.B.i. Build and maintain a record of key data sources:
Maintaining a centralised and up-to-date record of key data sources can help to ensure that data is accessible and can be effectively utilised. Organisations should create a living document or database that catalogues all relevant data sources, detailing key metadata such as the data category (e.g., transport, social media, geo-location), vendor contact details, methods of access (e.g., API, bulk download), update frequency, associated costs, and any licensing or usage restrictions. It can also be helpful to track whether the data is static, such as census information, or dynamic, like live social media feeds.
Such records can also benefit from information about interoperability, noting which data sources can be integrated or used in combination with others to enrich analysis.
3.B.ii. Conduct horizon scanning for emerging data sources:
Regularly scanning for emerging data sources is good practice for keeping predictive capabilities up to date. Organisations should monitor developments in relevant industries such as transport, social media, geolocation, and deployable sensing technologies to identify new opportunities. This can best be achieved by subscribing to updates from key vendors. Maintaining active relationships with data vendors is particularly beneficial, as vendors can provide early insights into upcoming features or new datasets. Collaboration with academic and research institutions can also help uncover novel data sources or methodologies that are being explored in experimental or pilot studies.
3.B.iii. Develop strong relationships with data vendors:
Fostering strong relationships with data providers is key to ensuring ongoing access to valuable datasets and maintaining sustainable data-driven workflows. By establishing collaborative partnerships, organisations can negotiate agreements for continued data use, receive technical support, and adapt to changes in data availability over time. There are four key aspects to be aware of:
-
Engagement: Actively engage with data providers to clarify key details, including data availability, usage terms, and potential support options. Regular communication helps maintain trust, ensures transparency, and allows teams to discuss data quality, provide feedback on use cases, and explore opportunities for tailored features.
-
Contracts & Agreements: Establish formal agreements that define the terms of data access, including usage permissions, costs, data retention policies, and compliance requirements (e.g., GDPR obligations). Preferred partnerships can help secure priority access to data and more favourable commercial terms.
-
Technical Support: Setting up a dedicated support channel with data providers can help address technical issues and ensure data quality. Maintaining an open feedback loop also allows both parties to improve data reliability and usability over time.
-
Collaboration & Value Exchange: Sharing aggregated and anonymised insights from analyses with data providers can build goodwill, demonstrate the value of their data, and encourage further collaboration.
3.C. Defining event boundaries
Accurately defining event and site boundaries is essential when analysing participation, particularly when collecting geospatial data. The most reliable source for these boundaries is official data, provided by location owners or event organisers. For example, some events provide official boundary data in formats such as GPX files, which contain a trajectory of geographic points mapping a race route.
For race events, useful route data can often be obtained from official race websites (e.g., Edinburgh Parkrun), extracted from Google Maps (as a KML file and then converted), or sourced from platforms that collate race routes, such as Plotaroute. However, when official boundaries are unavailable—particularly when collecting data across a wide range of sites or events—secondary sources like OpenStreetMap offer a reproducible alternative to manually drawing boundaries. OpenStreetMap boundaries can be retrieved programmatically by querying its API, ensuring a systematic approach across multiple locations. In the process of obtaining mobile phone app data for training models in this research project, boundaries for around 300 sites were obtained from OpenStreetMap.
When working with GPX or similar data formats, routes are typically represented as lines rather than areas. To make them usable for spatial analysis—such as filtering mobile phone or Strava data—these lines must be buffered to create an appropriate area. This process requires converting coordinates to a projection system that represents distances (e.g., World Geodetic System 1984/Pseudo Mercator projection, EPSG 3395) rather than coordinates in degrees.
4. Evaluating data sources
Accurately predicting attendance relies heavily on the quality and suitability of the data sources used. Analysts must therefore carefully evaluate which sources best meet their specific needs. The process of evaluating data sources is critical to ensuring that the insights derived are relevant, accurate, and actionable while remaining ethical and compliant with legal standards. This section (section 4) provides guidance to help analysts assess data sources and make informed decisions about their use. Three key areas are central to evaluating data sources.
The first is relevance; understanding which of the available data contains the information required to understand and measure participation at the event in question. (e.g. Nature of the event; sporting, cultural, location, etc., Attendee behaviour; movement, connectivity, etc., Required insights; demographic, time frame, etc.)
The second is comparative evaluation; including considering granularity, frequency and accessibility of data. Here analysts must distinguish between multiple relevant data sources by considering their practical effectiveness and suitability for their use case. (e.g. Dataset specificity, capture and update time frame, and practical and financial barriers for accessibility)
The final key area is ethical and legal suitability, which addresses whether the data source should be used, rather than whether it is practical or relevant. (e.g. Proportionality; use case context, Legal and Compliance; regulatory alignment, and Identifiable data, minimisation of PII)
4.A. Relevance
Analysts must assess whether a given data source is suited to the specific characteristics of the event, the behaviour of expected attendees, and the type of insights required. Analysts need to consider factors such as the nature of the event (sporting or cultural), the likely behaviour of attendees (e.g., their use of social media or GPS-enabled devices), and the type of insights required (e.g., demographic breakdowns, real-time data). Evaluating relevance also includes assessing the availability of historical or real-time data and the breadth of its coverage, particularly regarding demographic information it can communicate about attendees.
4.A.i. The Nature of the Event
The nature of event (sporting or cultural) significantly influences which data sources will be most relevant. Certain data types are inherently better suited to specific kinds of events. For example, activity data are particularly valuable for sporting events where participants are actively involved, such as marathons or cycling races. However, this data has limited utility for cultural events like concerts or festivals, where attendees are typically spectators rather than participants in sports. Social media data, on the other hand, offers more flexibility and can provide insights across a range of event types. The specific platform used matters when considering audience demographics. Facebook, for instance, tends to capture an older audience, while Instagram is favoured by younger users. Analysts need to align the choice of social media data with the demographics of the event’s expected attendees. The coverage of data sources such as mobile location data and deployable sensing data from drones tends to be comprehensive across demographic groups.
4.A.ii. Attendee Behaviour
The behaviour of attendees plays a crucial role in determining the relevance of certain data sources. The utility of mobile data or GPS-based location data depends on whether attendees are likely to have their phones turned on and connected to a network. This can be a challenge in highly rural areas or indoor venues with poor signal, where mobile data offers less comprehensive coverage.
The setting of the event (indoors, outdoors, or in a remote location) impacts the effectiveness of certain data sources. Drones are rarely deployable to indoor events. Similarly, CCTV data is more effective for urban settings but less so for rural or open-air locations.
4.A.iii. The Type of Insights Required
The insights analysts hope to derive from the data will also shape which sources are most relevant. Where the goal is to understand demographic breakdowns, social media data, aerial drone footage, or mobile app data alone may not suffice. These sources can provide valuable information about activity and movement patterns but often lack detailed demographic information. In such cases, combining these dynamic data sources with static data, such as census data or local demographic surveys, provides a fuller picture.
The time frame for data collection also matters. If insights are needed over an extended period, GPS or location data, can provide historical movement patterns and may be most suitable. Social media data often offers a snapshot of activity, making it better suited for real-time analysis or short-term trends. Real-time insights can be collected from platforms like Instagram or drone/camera live feeds can be valuable, whereas X (Twitter) and location data are typically only available retrospectively. This distinction is particularly important for events that require monitoring as they happen, such as ensuring crowd safety or managing transport logistics.
4.B. Comparative evaluation: Granularity, frequency and accessibility
Selecting between a series of relevant data sources requires evaluating their suitability in three main ways: granularity, frequency, and accessibility. This framework allows for a structured approach to understanding the granularity of the data (how detailed it is) and the frequency with which it is collected (daily, weekly, monthly etc).
High-granularity, high-frequency data is necessary for estimating attendance at shorter or one-off events, whereas data collected consistently over an extended time can help predict attendance at long-running events or spaces. It also highlights the importance of accessibility, encompassing both the practical ease of obtaining the data (e.g., straightforward API access) and financial considerations, such as licensing costs.
Figure 2: Granularity/Frequency data chart
4.B.i. Granularity
Granularity refers to the level of specificity in a dataset. Datasets with finer granularity, such as those providing data at the level of individual postcodes or attendee movements, can generally be prioritised for their ability to support precise attendance predictions. Higher granularity can also refer to temporal resolution (e.g., minute-by-minute vs. daily trends), data precision (e.g., exact GPS coordinates vs. broader zones), data richness (e.g., a social media post with text, image, and location vs. just a check-in), and sensor detail (e.g., high-resolution aerial imagery vs. lower-quality footage). In the case of activity data, it could mean individual running paths vs. aggregated heatmaps; for social media, tagged locations vs. inferred presence; and for aerial data, pixel-level clarity vs. broader object detection.
Mobile app location data utilised in this project, for example, was highly granular and provided detailed movement insights. However, it is important to note that this data is anonymised and does not track individual participants. In contrast, social media and activity or transport data were found to be less granular, offering only aggregated or abstracted figures. While video footage from CCTV or drones can be useful for real-time visual counts, it typically lacks accompanying metadata such as demographic details or broader context, limiting its ability to inform attendance trends beyond what is immediately observable.
4.B.ii. Frequency
Frequency in this context refers to how often data is captured and updated. High-frequency datasets, such as those collected daily or even in real-time, are highly valuable for estimating attendance at short-term or one-off events. Social media data, for example, is often updated continuously and – depending on the platform – can provide insights during and immediately after an event. Deployed video footage, in cases where is accessible with live feeds, offers high-frequency data suitable for capturing real-time trends, though this is rare.
In contrast, datasets like mobile app location data, which may update less frequently or provide aggregated patterns over longer periods, can be well suited for ongoing or longer-term attendance predictions, such as estimating visits to a museum over a year. The required frequency of data collection will depend heavily on the nature and duration of the event being analysed.
4.B.iii. Accessibility
Accessibility is an assessment of the practical and financial steps associated with acquiring and using a dataset. This includes considerations such as technical challenges, licensing agreements, and associated costs. Datasets requiring complex API access or advanced technical skills for retrieval may pose barriers to usability for certain project teams.
Cost is another important factor. Data sources with high subscription fees, exclusively long-term contracts or restrictive licensing arrangements may be unsuitable for shorter-term predictions but could remain viable for longer-term departmental use. Some datasets will also be more computational resource intensive to analyse and therefore more expensive. Acceptable cost thresholds will vary project-by-project and department-by-department.
4.C. Ethics and legal
It is critical to apply a set of absolute measures that address broader ethical, legal, and proportionality considerations. Analysts must ensure that all data usage complies with applicable privacy laws, such as General Data Protection Regulation (GDPR), particularly when working with personal or location-based data. In this research, personal data was not used. Instead, all data was anonymised or aggregated to minimise privacy risks. Additionally, in some cases, it is necessary to verify that individuals have provided informed consent for their data to be used in predictive analysis by the data brokers in question.
4.C.i. Proportionality and Use Case Context
The proportionality case for data acquisition and use in any modelling approach depends heavily on the application. Analysts must consider the importance of the use case and use the least invasive data that will deliver the requirement. It is essential that all actions align with the 7 key UK GDPR principles, ensuring that data is processed lawfully, fairly, and transparently, with purpose limitation, data minimisation, accuracy, storage limitation, integrity and confidentiality, and accountability. For critical applications, such as ensuring public safety at large-scale events or responding to potential crowd control challenges, more data-intensive collection methods may be justifiable. Conversely, for less critical applications, such as estimating attendance for cultural or recreational purposes, the collection of highly granular or sensitive data may not be appropriate.
Teams should carefully weigh the intrusiveness of the data against the importance of the use case, ensuring that the methods employed align with the potential risks and benefits. A good rule of thumb is if you don’t need to retain personal data to fulfil a particular use case, it is better not to. This will help to ensure compliance with key UK GDPR principles, particularly data minimisation. Where it is necessary to hold, appropriate review, retention and deletion (RRD) policies should be put in place.
4.C.ii. Legal and Compliance
All data collection and usage must comply with applicable legal regulations, particularly GDPR regulations, when personal data is involved. This is particularly relevant for datasets such as GPS data or social media data, which may include identifiable personal information. To ensure compliance, teams should consult with internal legal and compliance functions before integrating such datasets. Key actions include evaluating the necessity of personal data, identifying and mitigating privacy risks, and implementing data protection measure, such as removing low counts and anonymisation techniques.
These steps should be recorded, reported to and agreed with the relevant Data Protection Officer in a data protection impact assessment (DPIA). This assessment should be completed before data is acquired and updated throughout the project lifecycle as needed to keep internal compliance teams up to date. This process can be lengthy owing to the detail required, so should be completed well in advance of acquiring or ingesting relevant data.
4.C.iii. Identifiable Data
Where personal data is used, minimising identifiable information becomes key to reduce privacy risks. Teams should anonymise or aggregate data wherever possible. For example, GPS or activity data can be processed to remove individual-level detail and focus on aggregate patterns. For datasets like social media activity or video footage, identifiable information such as names or faces should be removed or obscured using regex techniques. For example, Facebook or X (Twitter) posts can be stripped of usernames and other identifying details and teams using video footage should seek to blur or remove faces or other identifiable features in CCTV or drone footage.
Under no circumstances should facial recognition technology be used in this context. This toolkit, as well as the research underpinning it, was developed without the use of such methods to safeguard privacy and maintain public trust.
5. Accessing data sources
Accessing, preparing, and managing data effectively are key steps in leveraging data sources for attendance estimation. This section provides guidance on how teams can navigate the technical, organisational, and logistical aspects of data acquisition. It covers two broad methods of obtaining data: (A) accessing data through application platform interfaces (APIs) and direct downloads or (B) Direct access. It concludes with reflections on how to prepare data for handling.
5.A. Application platform interface (API) access**
Many data providers make their datasets available through APIs, allowing users to access data in real-time or at regular intervals. Before integrating an API into your workflows, it is essential to thoroughly review the data it provides and ensure that its use complies with licensing agreements and terms of service. Other considerations include whether the data meets the specific requirements of your use case, such as the granularity, frequency, or timeliness of the stream.
To access data securely via API, users must typically configure API keys or other authentication mechanisms, requested from the platform provider, typically for a license fee. These keys act as unique identifiers that authenticate your access and ensure that data requests are securely transmitted. Good practice for managing API keys include storage in a secure environment and avoiding public sharing. Detailed guidance on setting up API keys and authentication can be found in the linked Technical Documentation for this project, which also includes examples of how researchers behind this toolkit implemented secure API access in code, with associated documentation for reference.
To further support teams, outlined below are some of the key steps taken to access the datasets via APIs in this research. Where relevant, lessons learned from working with these specific datasets are set out.
Pulsar tips and tricks
The Pulsar social media Listening Platform can be utilised to collect social media activity. Setting up queries to filter relevant posts is straightforward on the platform, supported by an intuitive user interface and workflow.
Data is either collected via keywords (for example whether the social media post contains a specific term) or via a channel (for example, all messages within a specific channel or page). The type of data and its volume available for collection via Pulsar differs depending on the social media platforms’ Terms & Conditions. The key steps in accessing Pulsar data are as follows:
-
Define the query: a) Majority of data sources: Pulsar requires a Boolean query. b) Facebook: Convert your Boolean query into keywords separated with a comma for the AND operator or separated with a newline for the OR operator. c) Instagram: Input only hashtags, e.g. ‘greatnorthrun’.
-
Select your query dates, noting that only live data can be collected for Instagram and only live data and that from the last 30 days can be collected for Facebook.
-
Launch the query / pull the results into Pulsar.
-
Download the results using the Pulsar API, noting for live queries you will need to allow the search to run for the desired time beforehand.
When formulating a Boolean search query, it is essential to optimise your search for the optimal data to avoid excessive and noisy returns, which may be both superfluous in modelling and expensive. This is particularly important given the cap on monthly downloads that data providers often apply. Tips for generating a Boolean query for your event or attraction:
-
Include all possible names for the event e.g. ‘British Museum’, ‘The British Museum’, ‘BM London’, ‘britishmuseum’, ‘#BritishMuseum’.
-
Include possible key words, e.g. ‘Women’s Euros Screening’, ‘Fanzone’, ‘Goal scorer’s name’, ‘Team names’, Location information’ ect.
-
If there is a space in the phrase, put it in quotes, e.g. “event name”. Include attractions, exhibits or performers at or within the event / location, e.g. a popular painting in an art gallery, an attraction or monument in a national park, or music artist at a festival.
-
Consider combining fewer specific details with the location, e.g. ‘eventdetail AND location’.
-
Use parentheses to group search terms, e.g. ‘(eventname OR eventdetail) AND location’.
-
Use the NOT operator to exclude irrelevant posts, e.g. ‘eventname NOT “irrelevant term”, NOT news’.
-
Add ‘AND NOT RT’ to exclude retweets.
-
If you are struggling, a conversational AI such as ChatGPT, Google Gemini or Microsoft Copilot can provide a good starting point, e.g. by asking “Please generate a boolean query of search terms to identify social media posts for X event.” These platforms also offer a good way to sense check your query.
An example query for the Women’s Euro 2022 Final Screening at Manchester Fan Zone:
((manchester OR “piccadilly gardens” OR “picadilly gardens” OR “picadily gardens” OR trafford) AND (euros OR final OR lionesses OR england OR germany OR wembley OR kelly OR toone OR williamson OR uefa OR eng OR ger OR weuro2022 OR lucybronze OR ellatoone99 OR leahcwilliamson OR itscominghome OR “its coming home” OR “it’s coming home”) AND (screening OR “big screen”) OR fanzone OR “fan-zone” OR fanpark OR “fan zone” OR “fan park”) AND NOT RT
Strava metro online dashboard tips and tricks
The Strava Metro dashboard can be used to access aggregated cycling and running activity data from Strava apps. The dashboard’s API can be used to request data for specific regions and then downloaded directly from the dashboard.
The key steps in accessing Strava data are as follows:
-
Select Area of Interest on Dashboard: Use the dashboard to select a shape that covers your area of interest. Selecting the right area is time-consuming but crucial—ensure your boundary fully captures the race route. Full guidance for defining boundaries is included at the end of Section 3 (‘identifying data sources’) of this toolkit. Add a buffer zone around the route to include all runners, especially at start and finish zones, where race data may drift.
-
Download the Data: Once the area and timeframe are selected, submit a request to download the data. At this stage it is important to check that data is organised to counts unique ‘runners’ rather than total ‘route’ traversals, which can be selected for directly via the dashboard. The data download process is slow and can take 1–2 days, depending on the area size and number of requests. For this reason, it is best to submit them overnight or before weekends. Large areas or long timeframes will significantly increase wait times—test with a smaller sample first.
-
Understand the Downloaded Data The download includes: A CSV file containing count data for each path. A .SHP (shapefile) containing the geographical coordinates of each path.
-
Filter Paths to Your Specific Area The selected area on the dashboard may cover a larger region than needed. Use the .SHP file to filter and extract only the paths relevant to your area of interest.
-
Consider Boundary Issues Strava Metro breaks areas down by region (or borough in London). If analysing routes that cross boundaries (e.g., the London Marathon route), you may need to combine data from multiple regions.
5.B. Direct download access
Where not available via an API, data may be accessed through direct downloads or by collaborating closely with data providers to obtain static datasets or sensitive media (as in the case of aerial photography). In the research conducted for this project we took this approach with two data sources - location data shared by Huq and photographic or video data. Here there are also learnings that can be extrapolated to other data sources.
Huq tips and tricks
The Huq dashboard can be used to obtain anonymised app location data for sites of interest. The dashboard is set up for requested sites and contains various features such as footfall counts, dwell time, catchment of visitors (where users come from), and spatial patterns of use. Most of this data is downloadable via the dashboard in an aggregated format. Huq are also able to provide bespoke data for more specialised applications which are not products on the dashboard – subject to direct engagement and consultation with their team.
Some important steps and considerations when obtaining this data:
-
The way in which event or site boundaries are defined geographically is an important step in the data collection process. It is possible to provide official boundaries or draw boundaries using a tool provided by Huq, but this process will impact the data extraction process.
-
Ensure the data requested meets the granularity and frequency requirements depending on the event or site of interest. Note that shorter events may lead to challenges, given the methodology developed in this project requires data for a minimum 24-hour period.
-
Not all information is downloadable from the dashboard. It may be necessary to obtain certain information directly from Huq. For example, spatial patterns of use or weighted estimates of visitation.
-
The dashboard is generally suitable for the study of a small number of events or sites. When collecting data for a large sample or sites, as in the training data for this project, it is advisable to obtain aggregate data directly from the data provider.
There are also some key steps analysts should consider limiting bias in mobile app datasets:
-
Filtering for GPS Accuracy: Data can be filtered based on an accuracy threshold to exclude low-precision data points from analysis.
-
Adjusting for Sample Bias: Differences in the coverage of the mobile phone population across geographic areas should be compared to known population distributions, allowing for the incorporation of these differences into scaling visits to an event.
Common datasets that may be accessed directly also include video footage or photographic data from CCTV systems or drones. Such datasets may include sensitive information, necessitating stringent data handling and storage practices to access. Direct access may also encompass static datasets such as footfall estimates derived from camera feeds or other systems, which are typically provided as pre-processed files.
Collaborating with the data provider is crucial when accessing data directly. This collaboration often involves clearly defining the data requirements, establishing a secure method for transferring the data, and ensuring compliance with legal and regulatory frameworks, particularly when the data contains personally identifiable or sensitive information. In most cases email will not be secure enough for these purposes and be unable to transfer files of the requisite size. Alterative data sharing solutions are listed below. To illustrate the process and support teams, some tips and tricks are included below for one data type.
Tips and tricks for accessing and handling video footage from CCTV systems and drones
Video footage from CCTV cameras or drones deployed at a public event can be used to analyse crowd patterns and attendance:
-
Establish Permissions: Obtain explicit legal consent from event organisers before accessing CCTV footage or from the providers of the drone footage. Ensure compliance with data protection laws and event-specific regulations regarding video analysis, including completing Data Privacy Impact Assessments (DPIA), a requirement under GDPR.
-
Secure Data Transfer: Use a permission-based file-sharing system to transfer video securely. File sharing solutions, such as Microsoft OneDrive, Google Drive, Dropbox Business, WeTransfer and Kiteworks. Ensure encryption and access control measures are in place to prevent unauthorized access.
-
Data Anonymisation: If required, apply blurring or masking techniques to anonymize individuals in the footage.
-
Storage and Retention Policies: Store footage in a secure, access-controlled environment. Define a clear retention policy—delete footage after analysis unless explicitly needed for further study. Ensure compliance with GDPR (if in the EU) or other relevant data protection laws.
-
Video Resolution & Format Considerations: Ideal resolution; 1080p or 720p for best balance between quality and processing speed. 4K footage; Can be slow to process but can be downscaled using FFMPEG if necessary. Lower quality footage; May still be usable if the crowd is not too dense or the camera is not positioned too high. Frame rate; Not a critical factor for crowd estimation. Preferred format; MP4 is ideal, but other formats can be converted using FFMPEG if needed.
-
Preprocessing for Crowd Size Estimation: Extract individual frames and images before estimating crowd size. Group frames by shots (continuous segments of filming). Select a representative sample—roughly 4 frames per second—for analysis.
5.C. Preparing for data handling
Data often requires preparation to make it usable for analysis. This involves transforming raw datasets into appropriate formats and structures, ensuring consistency and compatibility with analytical tools and methods. Guidance in this area supports teams to manage diverse data types while maintaining data quality, across three main steps: (1) Confirming data format, (2) Removing noise, (3) Building data pipelines.
5.C.i Confirming data format
The first step in preparing data is to confirm its format. Common data formats include JSON, CSV, geospatial data files, or image and video formats for media. Researchers must verify that the format is compatible with the tools and workflows they intend to use. For example:
-
JSON files might be used for data from APIs, which can be parsed and loaded into a database or programming environment like Python.
-
CSV files are often straightforward to work with, particularly for structured tabular data.
-
Geospatial data might require specialised tools, such as GIS software.
Data accessed through enterprise platforms often arrives in a structured, standardised format, minimising the need for extensive data engineering. However, for datasets acquired from diverse or less curated sources, additional preprocessing steps are often required.
5.C.ii. Removing noise
In many cases, raw datasets include a significant amount of noise—unnecessary, irrelevant, or misleading information that can interfere with analysis. Social media data is notoriously noisy and presents several challenges to analysts seeking to use it to predict event attendance:
-
Implicit Messaging: Social media often does not reflect every real-world event. Many users read content passively without engaging, meaning some activity related to an event may remain invisible. Similarly, not all participants in an event post content, creating gaps in representation.
-
Variable Locationality: Not every user commenting on or sharing about an event is physically present at the event. Many users may comment remotely, which can distort location-based analysis. This is particularly challenging for live events with a large live television or internet audience.
-
Misuse and Spam: Social media platforms may be targeted by bots, spammers, or even state actors spreading misinformation or disinformation. While this is generally less of an issue for events, it can still create noise that skews the dataset.
To mitigate these challenges, noise – where possible - should be carefully removed. A well-formulated keyword search strategy can filter out irrelevant content and reduce noise. This process is outlined in Section 1 of this Chapter on Application platform interface access.
Mobile phone app data presents specific challenges that must be carefully managed to ensure more accurate analysis:
-
GPS Accuracy of Data Points: The accuracy of location data varies depending on signal strength, satellite visibility, and device settings. Some data points may have high precision, while others are significantly less reliable.
-
Sample Bias: The mobile phone population does not perfectly align with the broader population. Certain demographic groups may be underrepresented, and mobile phone penetration varies geographically.
-
Time Series Analysis: When analysing data within small spatial and temporal windows, the dataset becomes sparser, increasing the risk of bias in trends.
-
Device Measurement Bias: Differences in data collection methods across mobile apps, including variations in user consent and data-sharing practices, lead to inconsistencies in tracking movement patterns.
By implementing some mitigation strategies, analysts can improve the reliability of mobile phone app data for mobility and event analysis.
-
Using Wider Aggregation Windows: When analysing time series data, it is advisable to use larger time windows or aggregate data across larger geographic areas to improve robustness.
-
Standardising Measurement Approaches: Aggregating user data daily ensures that each user is counted only once per day per site, minimising inconsistencies caused by variations in data collection processes.
5.C.iii. Building data Pipelines
To scale data usage effectively, building robust pipelines is important. Pipelines can help automate the processes of fetching, processing, and storing data, allowing teams to manage large datasets more easily.
-
Automated Pipelines: Pipelines can be configured to automate data acquisition, such as scheduling regular downloads from APIs or other sources. Tools like Apache Airflow or cloud-based workflow schedulers can be used to streamline these tasks.
-
Scalable Systems: If high data volumes are anticipated, scalable systems like distributed cloud storage or big data processing frameworks (e.g., AWS S3, Google BigQuery) should be used to manage and store data efficiently. These systems provide the flexibility to scale as the volume of data grows.
-
Machine Learning Pipelines: For projects leveraging AI techniques, it is important to build ML-specific pipelines. Frameworks like Hugging Face or TensorFlow Extended (TFX) can be used to preprocess data, train models, and integrate them into workflows. These pipelines interact with data by automating tasks such as feature engineering, model deployment, and monitoring, enabling teams to focus on generating insights and refining their models.
6. Data strategy summary
The research underpinning this toolkit delivered four broad methodologies for predicting attendance, each leveraging various data sources to predict attendance at five target events. Those methodologies are outlined separately in the following reports, each focusing on one of the target test events from this research project: British Museum, Bradford City of Culture, Great North Run, Giant’s Causeway, and Women’s Euros Final Screening.
Contained in the table below is a summary overview of the strengths and weaknesses of the four key data types explored in our research. It expands on the summary information provided in the executive summary of this report. They are assessed against a series of metrics that relate to: (i) The breadth of event sizes, durations, types and locations they can be used to analyse, (ii) The budgetary, ethical and accessibility considerations associated with that data source. For ease of use a RAG (red, amber, green) rating is also applied, where:
- Red = Significant limitations with data type.
- Amber = Some limitations with data type.
- Green = No limitations or minimal limitations with data type.
Social Media
Event Size and Coverage - AMBER: Optimum for medium to large events, as a minimum event attendance is necessary to create a digital footprint that will register on social media platforms. An event must produce a minimum of several hundred post, accounting for approximately 5% of attendees, to be usefully predicted with social media data.
Event Duration Coverage - GREEN: Social media methodologies are particularly useful for one-off events, where other methods (such as location data) can struggle. Methodologies leveraging this data type can also be used to make longer term attendance predictions at events that recur over a longer period of time.
Event Type Coverage - GREEN: Suitable for sporting or cultural events.
Event Location - GREEN: Useful for predicting attendance at a wide range of locations.
Budget - GREEN: Costs vary by plan, with Pulsar platform and number of downloads anticipated. The subscription plan used in this project allowed for hundreds of thousands of queries at a cost in the low thousands. Most methodologies seeking to exploit this data are likely to require server costs as LLMs will be used to categorise social media posts.
Ethics - AMBER: Social media data in its native format is susceptible to included personally identifiable information in user profiles and images. This information should be anonymised to manage risk.
Data Access - AMBER: Accessing social media data will normally require interacting with platforms’ APIs. In most cases, this is best approached through a managed tool who handle compliance and governance requirements as part of the service. The Pulsar API was used for this reason in this research.
Mobile App Data
Event Size and Coverage - GREEN: Location based mobile data can be useful for predicting attendance at events of all size, where attendees have a network-connected mobile device.
Event Duration Coverage - AMBER: This data type is most useful for predicting attendance at events or locations than run over longer periods of time. For example, sites like the British Museum or Giant’s Causeway, explored for this research (open all year round) are good candidates. This is primarily because the shortest period for which location data is broken down to for is 24 hours.
Event Type Coverage - GREEN: Suitable for sporting or cultural events but requires attendees to have access to mobile phones during the event.
Event Location - GREEN: Useful for predicting attendance in any area with mobile phone signal and where people will have phones turned on. This may be impeded by indoor or poorly connected rural locations.
Budget - GREEN: Some cost in the low thousands per location, depending on time-period over which data will be collected.
Ethics - AMBER: Raw data may be considered a form of personal data, while aggregated data poses fewer ethical risks. Companies like Huq collect location data with user agreement, though its classification as active consent could be debated, given that mobile users often accept app terms and conditions without fully reading them.
Data Access - AMBER: Data availability is wholly contingent on whether event attendees have network-connected mobile devices.
Activity Data
Event Size and Coverage - GREEN: Optimum for sporting events of any size, where those individuals who are in-scope for attendance counts have recorded their event participating on an activity tracking app like Strava.
Event Duration Coverage - AMBER: Best suited for predicting attendance at one-off sporting events or repeat sporting events (such as park runs or marathons) over defined time periods e.g. the duration of the target race event.
Event Type Coverage - AMBER: Best suited for sporting events, such as running, hiking, golf or cycling. Data recorded for walks and cycling where flagged as commute trips can potentially be used for non-sporting events.
Event Location - GREEN: Useful for predicting attendance at sporting events in indoor or outdoor settings, so long as participants track attendance on Strava.
Budget - GREEN: Subscriptions to Strava Metro are free for research organisations, urban planners and government authorities. HMG already has a subscription available for all UK sites. Additional data sources, such as weather, were accessed for less than £100 total.
Ethics - GREEN: Strava data is heavily anonymised and aggregated meaning it is extremely hard to identify any individuals. The user-base of Strava skews towards younger and more active demographics. Modelling approaches using this data will be biased towards these groups and can be expected to perform best in predicting events attended heavily attendance at events favoured by such groups.
Data Access - AMBER: Where participants at a sporting event have recorded their activity on Strava, the Strava Metro dashboard can be accessed easily to download aggregated data pertaining to that event. Downloading data from the Strava Metro platform, however, can be time-consuming.
Aerial Photography
Event Size and Coverage - GREEN: Most useful for large crowds, where attendance is otherwise hard to identify by human counting.
Event Duration Coverage - AMBER: Best suited for one-off events, where a camera accurately captures large sections of the crowd. In theory, aerial imagery can be used at events of any duration but requires footage that evenly captured the event over time, which is rare with the exception of fixed CCTV footage (this was not accessed for this project).
Event Type Coverage - GREEN: Suitable for any sporting or cultural event, where high vantage aerial crowd footage has been collected.
Event Location - AMBER: Typically, most useful for outdoor events, though some unique landscapes may lead to errors in counting as in the cases of the Giant’s Causeway where object detection models struggle to differentiate people from the rocky features.
Budget - GREEN: The technology needed to count attendance from video footage is not expensive to implement, but footage may potentially need to be purchased in many cases. Long processing times should also be expected, especially for longer videos, though compute costs are also generally quite low.
Ethics - AMBER: In some cases, video footage will require cleaning of data to reduce risk individuals are identifiable in content. In most cases, individual people are unlikely to be personally identifiable from aerial footage and without deliberate attempts to identify individual members of the crowd, there is very little risk of this taking place.
Data Access - RED: Required available data is high vantage aerial photography, taken throughout event e.g. from drone or CCTV (though CCTV tends to have lower output quality).
7. Evaluating performance
This section outlines some of the key considerations for analysts when evaluating the performance of the methodologies developed. Below seven evaluation criteria are outlined:
Accuracy
Measure of how close the predictions are to the actual known baseline attendance numbers, reflecting the correctness of the model’s predictions. Key considerations include:
-
It is important to assess accuracy against known attendance baselines, to robustly test performance.
-
Key measures of accuracy include mean absolute error - average difference between actual attendance and predicted attendance, measured in the same units as the data - and mean percentage error – the same measure, expressed as a percentage of the actual attendance.
-
Where a model demonstrates high accuracy on historical data but performs poorly on new datasets, it may be overfitting to historical data, reducing its predictive power. Here analysts should consider techniques like cross-validation or simplifying the model.
Bias
Evaluate of whether the model’s predictions systematically over-represent or under-represent certain groups or outcomes, indicating potential unfairness or imbalance in results. Key considerations include:
-
It is important to be wary of systematic over-representation or under-representation of certain groups, as this could skew decisions or resource allocation.
-
Analysts should regularly test model outputs for bias, comparing predictions across different demographic or regional groups. Where bias is detected, mitigation strategies such as rebalancing training data or introducing fairness constraints should be implemented.
Ethics
Assessment of the extent to which research ethics are adhered to by using this model, including respect for privacy, informed consent, and avoidance of harm:
-
Analysts should always evaluate whether their models respect privacy, especially when working with sensitive data such as GPS or social media activity.
-
Where ethical concerns arise (e.g. data privacy, public perception, etc), teams should consult internal compliance or ethics committees to identify alternative approaches or mitigation strategies.
-
Data anonymisation & privacy: anonymisation techniques may be required to ensure ethical data use.
Deliverability
Consider of how feasible it is to deploy and maintain the model in real-world scenarios, considering factors like data access, technical complexity, scalability and resources requirements. Models with high technical complexity may be difficult for teams to scale or maintain. It is essential to assess whether the skills and resources available align with the model’s requirements.
Cost
Examine the financial investment required to build, deploy, and maintain the model, including hardware, software, and human resources. Key considerations include:
-
Costs do not just pertain to initial set-up and deployment, but also ongoing licences and maintenance.
-
Open-source tools can often reduce development costs significantly, but analysts should ensure these align with their data governance policies and licenses give them permissions to use these tools.
Demographics
Evaluate of the model’s ability to incorporate data on different demographic groups, ensuring that predictions reflect a diverse audience and demographic information is captured for analysis where appropriate. Key considerations include:
-
Checking if demographic groups are excluded or poorly represented in the data, as this will reduce the model’s effectiveness and ability to make accurate predictions for events where these groups are well represented.
-
Where demographic information is unavailable, analysts may need to consider alternative methods, such as demographic imputation or additional data collection.
Generalisability
Measure of the ability of a model developed to predict attendance at one specific event to predict attendance at other events. Key considerations include:
-
Models that perform well for one event but poorly for others indicate limited generalisability. Here it is important to refine the model or retrain it on a broader dataset.
-
Where generalisability is limited, models should be explicitly marked as context-specific, with appropriate guidance on their scope and limitations, including refraining from applying them more broadly.
Accessibility
Consider of the challenges associated with securing access to data in a format that is usable and compliant with relevant platform level and wider governance considerations. Key considerations include:
-
Permissions & Governance: It is necessary to secure access from data owners and ensure compliance with legal and platform-specific policies.
-
Format & Usability: Data must be accessible in or transformable to a format suitable for processing.
-
Security & Data Transfer: In cases where sensitive or PII data is being used encrypted, controlled access methods may be necessary for secure transfer to prevent unauthorised use.