DCMS R&D Science and Analysis Programme - Capturing engagement numbers for unticketed sporting and cultural events: strand 1: annex 2
Published 13 March 2026
This report was authored by Jack Medlock, Hannah M. P. Stock, Andrew Knight, Donna Phillips, Adam L. Ozer, and Joseph Stordy at Verian, Dr Michael Sinclair, Dr Craig Macdonald, and Prof Iadh Ounis at The University of Glasgow, and Faculty.
This research was supported by the R&D Science and Analysis Programme at the Department for Culture, Media & Sport (DCMS). It was developed and produced according to the research team’s hypotheses and methods between October 2023 and June 2025. Any primary research, subsequent findings or recommendations do not represent UK Government views or policy.
1. Executive Summary
1.1 About the study
The Department for Culture, Media & Support (DCMS) commissioned a research team led by Verian, a social research agency, to undertake a R&D Science and Analysis project on ‘capturing engagement numbers’. The team also includes data scientists at Faculty AI, a technology business, and academic experts from the Schools of Computer Science and Urban Studies at the University of Glasgow. The purpose of the project is to research, develop and validate the success of new methods to measure engagement with cultural and sporting events and activities. This is particularly true of events and locations where traditional methods such as ticket sales or crowd counts are not possible and new, more advanced methods may be more accurate or cost effective. This case study represents an example of that work, utilising a source of mobile app data to estimate attendance at The British Museum.
An initial scoping exercise highlighted different types of data that could be used with appropriate modelling techniques to provide estimates. These types included, but were not limited to, mobile app data, transport data, social media data, and aerial photography. Prioritisation criteria were developed in the scoping phase against which individual data sources could be vetted, with those that were assessed to be of the most use progressing to a stage of deeper analysis and early testing.
A main barrier to access for data sources, otherwise deemed worthy of more in-depth analysis, is the considerable work required to meet the data protection requirements, including the source itself, how it will be accessed, and how it will be used. The other main barrier is cost, with some data sources requiring prohibitive upfront investment.
1.2 Huq Mobile Data case study
Within the category of Mobile App data, Huq is a data source that the research team were able to access earlier than others because it is already licenced by the University of Glasgow’s Urban Big Data Centre (UBDC). This pre-existing relationship enabled low-cost access to enough data to develop an early method for generating estimates of attendance.
A list of sporting and cultural sites was sent to Huq for which to provide data. The British Museum was selected as the best candidate to develop an early-stage method due to it having a fixed point of entry, high footfall, and existing visitation data to which to compare estimates. Modelling techniques were applied to the data to generate estimates of the home location of attendees, to enable reporting of catchment area, and visitation to the museum. The accuracy of the visitation estimates was then validated against the museum’s baseline visitation data.
The main conclusions at this stage of development of the method are as follows:
-
Utilising mobile app data to estimate visitation is a new and emerging field of analysis with few examples readily available. This approach has demonstrated its potential utility by estimating visitation numbers to the British Museum with a promising correlation to observed baseline data. It can also be used to show use of cultural sites / events by measuring footfall density across various time windows.
-
The analysis shows that it is possible to utilise mobile phone app data to estimate geographic reach. Early findings indicate that, for example, the geographic reach of the British Museum is concentrated in regions surrounding the site, the reasons for which can be investigated further.
-
There is also the potential to estimate geo-demographic trends by linking the Index of Multiple Deprivation (IMD) decile of the region to the estimated home location of the visitor.
-
Correlation is a statistical measure that describes the relationship between two variables and indicates how changes in one variable are associated with changes in another. A strong correlation is between 0.5 and 0.8 and indicates a substantial relationship that is unlikely due to chance. The analysis produces estimated visitation numbers that correlate strongly with trends in the official visitation data, including through the peaks and troughs of various lockdown restrictions. The strong positive correlation of 0.77 for the monthly estimate (with some month-on-month variation) is a positive outcome for a first attempt at this type of analysis. This is an experimental, previously untried approach for this specific use-case in both government and academic research which already demonstrates some viability with using these methods, even before further iteration and refinement.
1.3 Limitations
There are acknowledged limitations to this method, some of which are more likely to be overcome than others. The overarching limitation to mobile app data is limited population coverage, which could increase the risk of bias for discrete events with small time windows. The impact can be mitigated using techniques such as sample weighting or increasing the time period over which the estimates are generated (e.g. estimating monthly or quarterly visitation instead of daily or weekly). This is more suitable for longer-lasting events or venues. Within the category of population coverage, a limitation, pertinent to the example of The British Museum, is that Huq data does not include people aged under 16 years for reasons of data protection. It is also much less likely to include international tourists, because for a population to be represented in Huq data, they would have needed to have accumulated significant data within the UK over time.
Some of these limitations will be able to be overcome with access to more data, either from Huq itself, or by pairing this method with other methods using other data sources that fill in those information gaps.
A risk to this method is that data protection regulation covering both how the data can be collected and used could change, resulting in mobile app data becomes less accessible is difficult to predict with certainty the risk of this happening in the future, but many sizeable technology companies such as Apple have constantly evolving privacy policies and regulations that can change how this data is collected and distributed.
It is also acknowledged that there is the potential for future improvements in the current method as it may be possible to refine further with the use of additional modelling techniques, though these are likely to create marginal gains compared to the success of the initial estimate.
2. Introduction
2.1 Background
The Department for Culture, Media and Sport (DCMS) are exploring new methods to measure engagement with cultural and sporting events and activities to understand and facilitate engagement across its sectors. Currently the data has been mostly collected using traditional methods. At an event-specific level this might be by manually counting attendees. At a broader level, engagement is measured via surveys, which can be expensive, can suffer from low sample sizes and recall bias. Although well-designed and comparatively large surveys, such as DCMS’s Participation Survey and Sport England’s Active Lives Survey, give robust measures of engagement at a nationally representative level for a range of cultural and sporting activities. This is coupled with the benefit of having captured useful demographic and behavioural information, the data provides limited understanding of attendance and participation at local level.
The main challenge is with unticketed events, particularly those with no point of entry, which might include playing for a local sports team or taking part in a drop-in activity in an open area as part of a larger cultural programme of events.
DCMS commissioned a research team led by Verian and including world-leading data scientists at Faculty AI and academic experts from the Schools of Computer Science and Urban Studies at the University of Glasgow to undertake an R\&D Science and Analysis project on ‘capturing engagement numbers’. This project provides an opportunity to look at new methods (for example, using mobile data, aerial photography, wearable technology), which could go a long way to strengthening the insights gained from existing methods and filling some of the gaps in information survey data
2.2 Research Method Assessment Criteria
To guide the research the following research criteria were developed to underpin the assessment of each method.
-
Can robust attendance estimates be calculated, and for what types of events or activities?
-
Beyond the attendance estimates, what further detail can the methods describe about users, including demographics and participation?
-
For each method, an assessment of the following:
- Accuracy
- Biases
- Ethics
- Deliverability
- Cost
-
What issues and challenges could arise during data collection for each method (e.g. sampling biases)?
For the purposes of this project, a method is typically considered a combination of data source and modelling technique to generate an estimate. The success of the method will be measured against an agreed criteria, with a subsequent series of adjustments made to improve the accuracy of the estimate. The process of improvement could include the combining of one method with another method (or multiple methods), with the output of one filling the gap left by, or reinforcing the validity of, the first. For example, using a transport data source might provide good information about origin and destination, with the destination being the event of interest, but as it says comparatively little about the event itself, it might benefit from being coupled with mobile app data to determine engagement at the event.
2.3 Overview of Strand 1 approach
The approach is split into two strands:
-
Strand 1: To develop a comparative framework for different measurement methods for measuring engagement with cultural and sporting events and activities
-
Strand 2: To develop case study examples to test the application of different measuring methods, using mixed methods where appropriate, to measuring engagement with cultural and sporting events and activities
Strand 1 was further divided into a scoping or ‘Breadth’ phase, followed by a ‘Depth’ phase of more in-depth analysis of data sources evaluated as worthy of further investigation and modelling.
Figure.1: Introduction Process Flow of Breadth to Depth phase
At the beginning of Strand 1, a comprehensive scoping exercise was undertaken to identify data sources that could be used. Beyond the existing experience and expertise of the assembled team, a literature review was undertaken reviewing relevant papers from the last 5 years, and stakeholder interviews were conducted with Arms-Length Bodies (ALBs) (a full list is provided in appendix 2).
Following the initial scoping phase, the data sources were vetted against agreed criteria. Some were progressed to a secondary phase wherein the data were acquired, and preliminary experiments involving different modelling techniques were run. This enabled an early assessment on the quality of the method. The final stage of Strand 1 involves additional evaluation and a potential recommendation of a suitable candidate for further experimentation during Strand 2. This report discusses the earlier experiment stage during Strand 1 with Huq Mobile App Data.
3. Mobile data and Huq
3.1 Overview of Mobile Data
Mobile phone app data predominantly exploits individual-level GPS data obtained from the use of mobile applications on GPS-enabled devices. The data are routinely gathered by third party companies through Software Development Kits (SDKs), spanning a broad spectrum of mobile phone applications, including those for navigation, health, shopping, and weather, all under the umbrella of informed consent. The manner of location recording is contingent on the user’s consent per app, which might be continuous in the background or only when the app is active. The data generally offer the point locations of the device with high accuracy (up to metres), dependent on the most precise sensor available at the time of recording, be it GPS, Bluetooth, cellular tower, or Wi-Fi.
Research using mobile data to investigate the engagement in cultural and sporting events is in its infancy. Currently, third-party companies collect and harness raw mobile location data for commercial purposes, translating the data into analytical products capable of generating insights into visitor behaviour as well as establishing patterns across domains. However, the utility of mobile location data may extend beyond commercial use. As a precursor to the objectives of this project, recent research has recognised the versatile nature of mobile data could lend itself to inform applied research such as estimating engagement with natural and cultural sites (Merrill et al., 2020; Mears et al., 2021; Ghermandi et al., 2023). Application so far in the literature leverage mobile phone data to estimate spatial and temporal patterns in the use of urban green space (Heikinheimo et al.,2020; Mears et al., 2021; Sinclair et al., 2023b) and, in combination with ground truth visitation data, to model recreation and estimate visitation to water-based natural sites (Merrill et al., 2020).
Raw mobile phone data offers a high level of versatility, however, ethical concerns relating to data protection and the general reluctance of companies to provide such data necessitate that access to completely raw data is virtually impossible. Consequently, any research that utilises mobile data works with much less granulated forms, supplied by companies in a way that adheres to their privacy regulations. While this reduces the versatility of the mobile data, overall, it remains no less desirable as a part to a potential method of measurement, because aggregation over large time windows enhances the reliability and applicability of the insights but also minimises potential biases.
To identify possible sources of mobile phone app data, during the breadth phase the project team undertook an extensive desk research exercise which included thorough coverage of:
- Academic literature guided by the pre-existing understanding of this literature
- Industry reports, which also served to identify new data sources becoming available
- Government reports from prior work on similar research projects
- Reviews of new technologies relevant to this research, both now and in the future
This review examined relevant literature from the last 5 years and was conducted through online portals (such as Google Scholar) using keyword searches such as ‘mobile phone data’, ‘mobile phone app data’, ‘app data’, ‘location-based services’. Further to this research, the team engaged with subject matter experts in Academia, Government, and Industry, including established contacts and additional contacts identified through this phase of work. A full list is provided in Appendix 2.
3.2 Prioritising Huq data
To prioritise specific data sources for the depth phase in Strand 1, each data source was analysed against a series of prioritisation criteria based on original research questions. Through this process Huq data emerged as a leading data source against each agreed criterion:
| Prioritisation Criteria | Aim | Huq Assessment |
|---|---|---|
| Granularity and Frequency | To assess whether robust estimates are feasible to calculate from the data source | Unlike other mobile app data sources, Huq data benefits from demographic information an estimated visitor origin that is created by Huq. This led to Huq data being prioritised over other sources that were otherwise similar, such as data provided by ActiveXChange. The raw data used by Huq is collected in near real-time and is available on a UK-wide basis. It is updated daily, meaning techniques should be scalable to any site in the UK. |
| Access | To assess if it is likely the data sources could be used for this work | Several mobile app data sources that were initially identified as promising sources subsequently proved prohibitively expensive and were consequently de-prioritised for further exploration and experimentation. Engagement and scoping conversations with Echo Analytics, for example, did not progress after a prohibitively expensive cost model came to light. The University of Glasgow have a pre-existing partnership with Huq and through this project have developed a strong relationship with their data science team which has given this project direct access to data without a prohibitive cost model attached. |
| Attendance and Participation | To understand what types of events and activities the data sources could provide estimates for and what additional information they contain | Unlike other mobile app data sources assessed Huq data could allow for an estimation of demographics or reach for cultural sites or events. |
A full table for each mobile app data source greenlit in the Breadth Phase is provided in appendix 1.
Huq is currently licensed by the Urban Big Data Centre (UBDC) at the University of Glasgow for non-commercial research purposes and has already been successfully used by the team to assess how socio-demographically representative data are of a population. The results of the study were compared to the gold standard household survey in Scotland and found to be of a significantly high level of representativeness. Moreover, the team have used Huq data to consult Glasgow City Council on transportation planning and the use of greenspace within the city. The former project involved building origin destination matrices using Huq data and in 2023 was nominated for the most innovative project of the year at the Scottish Transport Awards.
3.3 Huq Data: Content and Format
The Huq dataset is derived from a range of mobile phone applications, which collect real-time location data from users’ smartphones’. The dataset covers geographic locations across the UK and spans five years (2019-2024). It can be used to generate insight into human mobility patterns and behaviour, such as consumer trends, the impact of events on movement, and support decision making processes including urban planning.
Data delivered to the project team contains one row, per site, per user, per day and includes:
- An anonymised unique user Id (which is irreversibly hashed)
- A weight assigned to each user that is based on user demographics, known as the adjustment factor.
- The home enclosing region of the user/visitor. E.g., Greater Manchester, Highlands. Note that ACORN (A Classification Of Residential Neighbourhoods) segmentation, created by CACI, is provided at a more granular level (e.g. postcodes).
- The sites user visited such as Edinburgh Castle, British Museum, or Buckingham Palace. This is recorded as a Polygon ID.
- The date when the user visited site.
4. Methodology
4.1 Developing a method using Huq data
Having progressed Huq data to the next stage of experimentation, and acquired access, this section will describe the development of a method to model the data to provide an estimate of engagement. This will include the approach to:
-
Selection of cultural, natural and sports sites
-
Detection of home location
-
Visitation estimates
-
Evaluation
4.2 Description of selection of cultural, natural and sports sites
To be able to model estimates of attendance and participation using Huq data, we had to first identify potential cultural or sporting sites. One of the important criteria in choosing sites was whether there was comparable existing baseline attendance data to compare against.
A range of sites were selected for the depth phase of the project to explore the feasibility of mobile phone app data to estimate user engagement. These are 45 sites that range from cultural sites to sports and natural sites. Some have fixed boundaries and are single use, while others are multiuse and without fixed boundaries. Table 1 highlights the list of sites where data has already been obtained from the provider.
Table 1: List of sites being explored and status of data transfer
| Name | Data status | Site type |
|---|---|---|
| The British Museum | Received | Cultural |
| Arthur Seat | Received | Natural |
| Dynamic Earth | Received | Cultural |
| Scottish Parliament | Received | Cultural |
| Buckingham Palace | Received | Cultural |
| Buckingham Palace Garden | Received | Natural |
| Celtic Park | Received | Sport |
| Edinburgh Castle | Received | Cultural |
| Kelvingrove Gallery | Received | Cultural |
| Kelvingrove Park | Received | Natural |
| The Burrel Exhibition (inside Pollok Country Park) | Received | Cultural |
| Pollok Country Park | Received | Natural |
| Stonehenge | Received | Cultural |
| Tower of London | Received | Cultural |
Figure.2: Specific cultural sites or areas of interest are selected by ‘clipping’ the polygon of interest for analysis. The British Museum is the subject of this case study and is ‘clipped’ in the image below:
4.3 Home location detection approach
To estimate visitation, it is important to have information about the geographic home location of the mobile phone users in the sample. Users in the Huq data are represented by non-reversible hashed identifiers, but no individual characteristics or home area data are provided. However, a home area can be inferred using a home detection algorithm. Given data points from individual users are linked over time using hashed (anonymous) identifiers, it is possible to estimate home area across multiple months or years. This is an important capability of mobile app data, as by estimating home location analysts can infer demographic information about attendees which could be beneficial for event organisers or venue operators. Note that Huq doesn’t publicly disclose the full technical details of this algorithm.
The home location detection approach adopted leverages Datazone boundaries for Scotland and LSOA boundaries for England and Wales. It also leverages 2020 residential land use data from Geomni’s UKBuildings layer. The Geomni’s UKBuildings layer is a multi-polygon spatial dataset that details the use, characteristics, and extent of commercial, public, and residential buildings across the UK.
For each unique user in the dataset, the home location for a given month is determined as the Datazone/LSOA which maximises the number of active evenings within residential and mixed residential space based on the Geomni UKBuildings data, where an evening is determined as the time between 20.00 and 06.00. There is a minimal risk that those resident in the same area could register as location visitors when they are in fact residents. However, give the size of the LSOA and careful drawing of venue boundaries the risk is considered extremely low.
It can be represented mathematically as follows:
Let:
- U be the set of users
- D be the set of data points for each user, with Du representing the data points for user u
- T be the set of timestamps for each data point, with Td representing the timestamp for data point d
- Z be the set of data zones/LSOA
- Eu be the set of evening data points for user u, where Eu = {d ∈ Du : 20:00 ≤ Td ≤ 06:00}
- Fuz be a binary function indicating whether user u was found active in data zone z during the evening, defined as:
- Hu be the home location for user u, determined as the data zone or LSOA with the maximum number of evenings spent[footnote 1], calculated as:
4.4 Visitation estimates from sample weighting
The mobile phone data can be viewed as a panel dataset which consists of mobile phone users and their use of varying mobile phone applications over time. Specifically for this case study, the data consists of mobile phone users known to have been in or within proximity to The British Museum on a specific day(s) within the past five years when Huq began collecting data. This panel represents a sample of the actual population.
Weighting techniques are used to generate estimates of cultural engagement at a population level, by adjusting for the sample of users who have visited a site at a geographic level, based on the assigned home area. These estimates of visitation days are then scaled to a population level based on known area population drawn from the latest available census data.
The formula for estimating the number of visits (V) to site s during time period t, adjusted for regional variations in mobile phone users is:
Where:
- R is the set of all regions in the UK.
- Ur is the number of mobile phone visitors in region r who visit site s during period t.
- The weight (Wr) for each region (r) corrects for over or under representation of mobile phone users by comparing the ratio of mobile users to the known population across all regions in the country.
- Cr is the correction ratio, scaling the weighted visitor days to an estimate of the population using the ratio to the known population for a region.
Wr is calculated monthly, and councils are used as the regional geography through which to apply the weight (W) and correction ratio (C). The time period (t) used is day. Daily visitation estimates were summed to a monthly level in line with the baseline data and to enable ease of comparison and visualisation.
4.5 Evaluation approaches
To evaluate the performance of the sample weighting-based model, we use Pearson’s correlation, Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE) as evaluation metrics. The descriptions and mathematical representations are provided below.
Pearsons Correlation
Pearson correlation coefficient (r) measures the linear relationship between two variables and mathematically represented as follows:
- xi and yi are the individual data points of the variables x and y
- x̄ and ȳ are the means of variable x and y
Root Mean Squared Error (RMSE)
Root Mean Squared Error (RMSE) is a commonly used metric to measure the accuracy of a model’s predictions. It is calculated as the square root of the average squared differences between the predicted values and the actual values. Mathematically it is represented as follows:
Where:
- actuali is the actual value.
- predictedi is the predicted value.
- n is the total number of observations.
Mean Absolute Percentage Error (MAPE)
MAPE is expressed as a percentage, with lower values indicating better model performance and higher accuracy. It is particularly useful for understanding the prediction error in relative terms, making it easier to interpret in the context of the magnitude of the actual values. However, it can be sensitive to very small actual values, which can disproportionately inflate the error percentage. Mathematically it is represented as follows:
Where:
- actuali is the actual value.
- predictedi is the predicted value.
- n is the total number of observations.
5. Results
The purpose of the case study is to provide answers to the research method assessment criteria included in the Introduction to this case study. To evaluate this method for the specific purpose of generating estimates of engagement with The British Museum, the case study aims to address the following questions:
- Can mobile phone app data be used to estimate visitation or footfall to cultural sites or events?
- Can mobile phone app data be used to estimate the reach and catchment or a site or event?
- How do visitation estimates from mobile phone app data compare to baseline official visitation data?
5.1 Estimating footfall
One of the most unique characteristics of mobile phone app data is the spatial resolution of the data, which consists of a set of high accuracy points generated using a wide range of mobile phone applications. This includes both building-level and potentially floor-level, made possible through advanced processing of high-resolution mobile location data combined with spatial mapping techniques. Figure 1 highlights this for the British Museum, visualising the density of footfall using the point location within the boundary of the site. Huq data was aggregated to a polygon level using a hexagonal grid and the colour represents a relative measure of footfall density across all polygons. Uber H3 hexagons at resolution 12 has been used for this aggregation. Hexagons are used as they are better than other geometries for such visualisation because they reduce edge connections.
From this visualisation, clear hotspots of use emerge within the site in relation to some key areas. While Figure 1 shows aggregated footfall across an extended time period using five years, it would be possible to analyse the use of space across various time windows. For example, weekend compared to weekday or when a specific event or exhibit was being displayed. The wider the time window used in the aggregation, the larger and more diverse the sample of mobile phone users being incorporated in the footfall pattern. When subsetting to small time windows, there is the risk of increasing biases given the reduced sample. It should be noted that this approach does not break down use by museum floor.
Figure 3: Footfall at the British Museum (2019-2023) from mobile app data. Note this is overall and not separated by museum floor level.
5.2 Estimating catchment
As described in the Method section, each visitor to the British Museum was allocated a home geography (as described in Section 2.1). This process was important for the methodology of estimating visitation but also to allow for the exploration of the catchment (or reach) of a site or event. Figure 4 shows the catchment as the percentage of visitation to the British Museum between 2019-2023 broken down by postcode district across the United Kingdom and Northern Ireland. Data aggregation means it is not possible (or ethically desirable) to widen the catchment area and explore movement of individuals from the British Museum to other local venues.
The most notable pattern from this visualisation is the density of visitation in the regions surrounding the site (figure 4), indicating the potential impact of accessibility on visitation. A detailed view of the region surrounding the site is provided in figure 3. Another notable result from the map is the extent of blank space. This does not mean that there were no visitors from these regions to the British Museum during the period but that mobile phone users are not observed from these regions during the time period. This result highlights the sample coverage of the data, and given the small size of the site of interest, it is not surprising to see such coverage. It may be possible in future to explore spatial methods which allow to model and impute values for missing regions based on their characteristics and surrounding regions.
Figure 4: Catchment of visitors to British Museum (2019-2023) from mobile app data
Figure 5: Catchment of Visitors to British Museum (2019-2023) from mobile app data, highlighting visitation from surrounding region. Note that the percentage values shown are using the complete UK data and not subset to London.
5.3 Visitation estimates and comparison to baseline data
Having extracted mobile phone visitors from the British Museum and allocated them to a home geography, this information was utilised to weight and scale the data to an estimate of visitation for the population. The results of this approach are shown in Table 2 where visitation is broken down for the British Museum by month across five years. Firstly, there is a clear pattern of reducing visitation after March 2020 when the UK entered a nationwide lockdown and the site closed to the public. Estimated visitor numbers remain lower than the 2019 peak through all the remaining months presented.
Table 2: Monthly estimates of visitation to British Museum from sample weighting
| 2019 | 2020 | 2021 | 2022 | 2023 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Month | Recorded | Estimated | Recorded | Estimated | Recorded | Estimated | Recorded | Estimated | Recorded | Estimated |
| Jan | 447,863 | 463,276 | 463,881 | 483,977 | 0 | 20,190 | 190,792 | 123,645 | 395,472 | 246,739 |
| Feb | 470,127 | 610,208 | 471,593 | 519,213 | 0 | 20,470 | 246,311 | 165,879 | 303,120 | 80,416 |
| Mar | 494,348 | 683,438 | 179,887 | 212,630 | 0 | 30,514 | 280,991 | 151,617 | 466,847 | 132,554 |
| Apr | 518,883 | 815,027 | 0 | 8,202 | 0 | 15,493 | 335,401 | 196,864 | 376,473 | 194,584 |
| May | 546,823 | 592,879 | 0 | 5,230 | 44,893 | 54,808 | 346,168 | 270,111 | 531,650 | 210,691 |
| Jun | 577,624 | 731,188 | 0 | 5,306 | 111,514 | 78,234 | 379,912 | 265,321 | 560,358 | 453,626 |
| Jul | 687,837 | 628,596 | 0 | 11,086 | 145,595 | 107,024 | 471,554 | 304,420 | 715,654 | 350,846 |
| Aug | 638,890 | 641,342 | 7,194 | 6,334 | 205,847 | 153,727 | 432,545 | 203,039 | 633,089 | 295,516 |
| Sep | 447,383 | 435,431 | 43,627 | 47,572 | 168,444 | 112,413 | 335,308 | 108,360 | 472,610 | 157,963 |
| Oct | 522,556 | 529,337 | 72,418 | 62,408 | 250,198 | 177,637 | 404,408 | 201,111 | 524,294 | 184,207 |
| Nov | 442,347 | 599,474 | 11,903 | 26,531 | 220,861 | 145,904 | 327,109 | 173,539 | 406,665 | 107,028 |
| Dec | 445,302 | 569,713 | 24,963 | 49,801 | 179,768 | 119,919 | 346,754 | 131,478 | 434,628 | 121,812 |
| Total | 6,239,983 | 7,299,910 | 1,275,466 | 1,438,289 | 1,327,120 | 1,036,333 | 4,097,253 | 2,295,384 | 5,820,860 | 2,535,982 |
The monthly visitation estimates from the sample weighting approach were compared to official visitation data for the British Museum in Figure 3. Here, results were aggregated at a monthly level and the trend was visualised over five years between January 2019 through December 2023. Despite the estimates being scaled and weighted from a relatively low number of mobile phone visits the comparison to official data showed a strong trend in general. This trend was especially clear in the early period of the Covid-19 pandemic, following the first nationwide lockdown when the museum was closed to the public. However, it was also apparent through the peaks and troughs following the easing of lockdown restrictions.
Although a strong trend is evident, the mobile phone data has some fluctuations to note. While prior to 2020 the data tended to overestimate visitation, after the first lockdown period, the data tended to underestimate visitation when compared to the current published statistics. This could be caused by several factors but two are more notable. Firstly, this data consists of mobile phone users who are aged 16+ and therefore children who represent a key visitor group in the official data are missing. Second, mobile users who cannot be allocated a home geography (because they are not in the dataset long enough) are not included in the current methodology, meaning international tourists are likely to be excluded. Low mobile user numbers, changes to the way the company collected the data over time and the pool of apps it draws from could also have an impact. This includes changes to UK legislation and tech company policy which can change the way mobile phone app data is collected over time. In 2018, for example, Apple removed apps from the app-store which shared data without consent in response to privacy concerns, and both Apple and other handset providers are potentially updating their T\&C’s to provide less data - particularly location data - to apps, with privacy in mind. This is more likely than the discrepancy between the Huq sample and actual visitation to change the strength of correlation over time such as pre and post 2020 lockdown.
Figure 6: Comparing monthly estimated visitation from mobile phone app data to official visitation data from DCMS
The results can be measured using the evaluation approaches outlined in section 2.3. Table 3 compares the accuracy of estimated visitation and the official visitation data using the Pearson’s correlation coefficient, the Mean Absolute Percentage Error, and the Root Mean Squared Error. The Pearson’s correlation coefficient for the monthly estimate is 0.77 (p-value <0.001). This strong correlation should be considered a good result for this stage of the project. The Pearson’s correlation is a statistical measure that quantifies the strength of a relationship between two variables from -1 (very strong negative correlation) to 1 (very strong positive correlation). The probability value (p-value) in statistics determines the likelihood that any relationship between the data is due to chance. A p-value of < 0.05 indicates strong evidence that our findings are not due to chance (less than 5% probability). The Mean Absolute Percentage Error (MAPE) is 75% and the Root Mean Squared Error (RMSE) of visitation is ~160,000 visits per month. The MAPE and the RMSE can be used as an evaluation of model performance. As both are fairly high in this case it shows that there is room for improvement in the performance of the model. This is almost certainly due to small values for some of the months. Model performance could be improved (and prediction error minimised) by increasing the sample size or looking at the data by quarter, depending on the size and visitors to a site, which would increase the sample size.
Table 3: Measures of monthly visitation accuracy (2019-2023). **denotes 0.001 significance level. Estimated monthly visits are compared to official visitation data
| Evaluation Measure | Result |
|---|---|
| Pearson’s Correlation Coefficient | 0.77** |
| Mean Absolute Percentage Error | 75.55 |
| Root Mean Squared Error | 159,809 |
** denotes 0.001 significance level
Estimated monthly visits are compared to official visitation data
6. Estimating Geo-Demographics
As well as being used to explore a site’s possible geographic reach and catchment, the data also has the potential to be used to estimate geo-demographic trends and characteristics for visitors. Figure 5 is an example of this potential, showing the percentage of British Museum mobile visitor days broken down by Index of Multiple Deprivation (IMD) Decile of the region where a visitor came from across the UK for each year.
Based on the IMD, each region is given a rank, and these ranks are split into percentiles, deciles, and quintiles. Deciles meaning 10 equal groups (1 being the most deprived 10% of the country, 10 being the 10% least deprived parts of the country). The numbers represent the percentage of mobile visitors grouped into these 10 groups for each year. For example, a percentage of 21.1% in 2020 in Decile 10 would mean 21.1% of the mobile visitors in that year came from areas which are in the 10% least deprived areas in the UK.
This analysis uses IMD data joined in the prior steps to the estimated home location of the mobile user. The dataset was collated by the Consumer Data Research Centre (CDRC), joining the SIMD from Scotland and Welsh/English IMD into one dataset using the Datazone and LSOA. This is 2019 data for England/Wales and 2020 for Scotland.
7. Considerations and limitations
Below is a table of the main limitations uncovered to date with Huq mobile app data, as well as potential mitigation and next steps.
| Limitation | Mitigation | Next Steps |
|---|---|---|
| Low population coverage of mobile phone app data means estimates should be treated with caution and may be unsuitable for some purposes. Subsetting to small time windows, for example, could increase the risk of bias by reducing sample size. | Weighting and scaling improves representativeness of the data. Estimates generated on larger sample sizes would produce even stronger results. The visitation estimates generated by the methodology show comparable trends to official visitation data as well as a strong correlation (0.77, p-value <0.001). | Prior to commencing work in Strand 2, establish with venues and events their use cases and limitations that would be sustainable. |
| Low user numbers in some months and regions could create a privacy risk. | The mobile app data in this analysis consists of de-identified personal data, meaning no personal identifiers are received. | As part of this project, raw data is not provided so there is no such risk in this work, but it should be noted for future projects as a finding here. |
| Limited coverage of mobile phone app users: Users aged under 16 are excluded on ethical grounds | Children’s attendance could potentially be modelled by identifying the relationship between adults and children’s attendance in the baseline data (if we are able to obtain the numbers for children and adults separately) and considering seasonal changes. Modelling attendance by international tourists could also be explored in a similar way by estimating tourists by looking at the gap between Huq and the baseline data (which does include tourists). | In Strand 2, carry out these pieces of explorative analysis and modelling for a few shortlisted sites. |
| Limited coverage of mobile phone app users: international tourists are unlikely to be included. | There is some limited data available (e.g. Visit Britain), but any estimates would need to be strongly caveated. | Look to alternative data sources as a possibility of closing this information gap. Alternatively explore whether estimates are possible albeit with the limited data available. |
| Home region is the only geographic/demographic variable that is included in the data. | Home region (Datazone, LSOA) has already been linked to the Indices of Multiple deprivation (IMD) to provide some demographic information. This could also be linked to other geo-demographic data, such as from the census, to provide more insight into who is attending cultural and sporting sites. | In Strand 2, explore geographic linking to other demographic datasets and produce analysis of engagement by demographic variables. |
8. Future steps
-
Iterate Current Approach: The current approach to estimate visitation will be extended to additional sites. The data on catchment will be combined with demographic data to explore the bias and sample demographics of mobile phone app data for a range of cultural sites. Some limitations have already been identifed, and some will always exist. More analysis is required to understand whether limitations mean this data source will be unusable, or whether they are accepted limitations.
-
Test Additional Modelling Techniques: Visitation estimates from a larger number of sites will be combined with site and context specific information, as well as other data sources, to explore regression and machine learning models which should further improve visitation estimates. We propose the utilisation of conventional machine learning models, such as Decision Tree and Random Forest, which have proven effective in regression tasks across numerous case studies (Balogun & Tella, 2022) (Ahmad, Reynolds, & Regui, 2018). Additionally, these models offer high interpretability and are computationally less demanding, with shorter training times compared to more complex deep learning approaches like Long Short-Term Memory Recurrent Neural Networks (LSTM RNN). Although models such as LSTM RNN are popular choices for time series data, their performance is heavily contingent on the availability of large training datasets. This dependency renders them unsuitable for our purposes, given the limited size of our historical training data (ground truth data). Additionally, studies specifically related to visitation estimates have shown that conventional machine learning models can be highly effective, often outperforming neural networks (Yap, Gong, Naha, & Mahanti, 2020).
-
Apply Test Outcomes in Strand 2: Initial discussions with test events/locations in Strand 2 to determine how usable or otherwise developed approaches will be. Identify sites for Strand 2 to start data aquisition.
9. Appendix 1: Greenlit Data Sources
Data Source: Huq
Description: Individual-level GPS data collected from the use of mobile phone applications on GPS enabled devices. Raw data is used by company to create dashboards on the use of different points of interest.
Granularity and Frequency: The raw data is near real-time. The company offers aggregate products based on this. Coverage varies by geographic area but coverage of around 1-5% of the population.
Access: Cost is generally to access a dashboard, based on number of points of interest. However, for this project they may provide data directly or through BigQuery access. Cost unknown, requested sample data. University of Glasgow have a partnership and access to raw data for non-commercial purposes. May be available through license extension for this project.
Attendance and Participation: Could be used to estimate attendance for fixed sites. Could be used to estimate demographics or reach.
Scalability: This data is available UK-wide and is updated daily, meaning techniques should be able to scale to any site in the UK. The techniques developed for this data source could be applied to other mobile app data sources in future should they become available.
Prioritisation for Depth Phase: Green - This data benefits from demographic and visitor origin information. Further, members of the consortium have a working relationship with the company. Data should be prioritised to be used in combination with regression and machine learning. It can also be used to estimate event demographics and reach in combination with sample weighting.
Latest Progress: Huq have provided processed data for 9 cultural/natural/sports sites throughout the UK from 2019 until 2024. The team have started to develop the first visitation estimates for these sites using the proposed sample weighting approach. We are set to receive data from Huq for a wider range of cultural and sports sites across the UK to enable regression and machine learning methodologies to progress. The team are in the process of collating baseline visitation data for sites before requesting mobile app data to support this.
Data Source: ActiveXChange
Description: Individual-level GPS movement data targeting health and wellbeing. The data captured is provided primarily as an ‘activity index’ that reflects the level of use in the specified timespan and geographic space
Granularity and Frequency: Aggregated at an hourly level.
Access: Data provided directly by the company. Cost is based on number of points of interest requested but currently unknown, sample data requested.
Attendance and Participation: Could be used to estimate attendance for fixed sites.
Scalability: This data is available UK-wide and is updated daily, meaning techniques should be feasible to scale to any site in the UK. However, we have not worked with this data yet.
Prioritisation for Depth Phase: Green - Data should be prioritised in combination with regression and machine learning but does not contain reach or demographic information.
Latest Progress: The data did not contain any information about reach or user demographics which limited its value over alternatives such as Huq, which we have progressed with.
Data Source: Unacast
Description: This data shows the number of device visits by day of the week and time of the day for residents, workers and other type of visitor profile.
Granularity and Frequency: Aggregates daily and hourly patterns at a monthly interval using quadgrids or census areas.
Access: Data available from Carto, cost unknown, sample requested.
Attendance and Participation: Could be used to estimate attendance for fixed sites.
Scalability: This data is available UK-wide and is updated daily, meaning techniques should be feasible to scale to any site in the UK. However, we have not worked with this data yet.
Prioritisation for Depth Phase: Green- Data should be prioritised in combination with regression and machine learning but does not contain reach or demographic information.
Latest Progress: This data sources was available through the Carto geospatial dashboard. Decision taken not to advance due to prohibitive cost model of Carto. Attempts to access the data directly from Unacast were unsuccessful
Data Source: CKDelta
Description: Using mobile app data: (1) Footfall by dwell times (2) Catchment Areas (3) Footfall by age bands
Granularity and Frequency: Aggregated (300m * 300m grid or 1km * 1km grid) every 30 minutes.
Access: Data available from Carto, linking three data products together. Cost unknown, sample requested.
Attendance and Participation: Could be used to estimate attendance for fixed sites. Could be used to estimate demographics or reach.
Scalability: This data is available UK-wide and is updated daily, meaning techniques should be feasible to scale to any site in the UK. However, we have not worked with this data yet.
Prioritisation for Depth Phase: Green- This data may benefit from some demographic and visitor origin information. Data should be prioritised in combination with regression and machine learning. It may also be used to estimate event demographics and reach.
Latest Progress: This data sources was available through the Carto geospatial dashboard. Decision taken not to advance due to prohibitive cost model of Carto. Attempts to access the data directly from CKDelta were unsuccessful
Data Source: Echo Analytics
Description: Develops footfall estimates to points of interest from mobile app location data.
Granularity and Frequency: Weekly, monthly, or quarterly trends in footfall counts. Company do not provide absolute numbers.
Access: Provided directly by the company. Cost unknown, sample data requested.
Attendance and Participation: Could be used to estimate trends in attendance for fixed sites.
Scalability: This data is available UK-wide and is updated daily, meaning techniques should be feasible to scale to any site in the UK. However, we have not worked with this data yet.
Prioritisation for Depth Phase: Green- Data should be prioritised in combination with regression and machine learning but does not contain reach or demographic information
Latest Progress: Decision not to advance due to prohibitive cost model
10. Appendix 2: Consulted Stakeholders
As part of the Breadth Phase the following stakeholders have been consulted in this study to date:
- Sport England
- Visit Britain
- Bradford 2025
- University of Bradford
- Bradford 2025 Events Team
- Spirit of 2012
- University of Warwick
- Geospatial Commission
The names of the individuals spoken to have been excluded in this report.
11. Appendix 3: References
Ahmad, M. W., Reynolds, J., & Regui, Y. (2018). Predictive modelling for solar thermal energy systems: A comparison of support vector regression, random forest, extra trees and regression trees. Journal of Cleaner Production.
Balogun, A.-L., & Tella, A. (2022). Modelling and investigating the impacts of climatic variables on ozone concentration in Malaysia using correlation analysis with random forest, decision tree regression, linear regression, and support vector regression. Chemosphere.
Sinclair, M et al. (2023). Assessing the Socio-Demographic Representativeness of Mobile Phone Application Data. Applied Geography Volume 158.
Verduzco-Torres, J.R. and Raturi, V. (2023). Can Smartphone Location Data at the Point Level be Used to Estimate Traffic Volumes?: A Methodological Evaluation. 12th Internatinoal Conference on Geographic Information Science.
Yap, N., Gong, M., Naha, R. K., & Mahanti, A. (2020). Machine Learning-based Modelling for Museum Visitations Prediction. International Symposium on Networks, Computers and Communications (ISNCC) (págs. 1-7). IEEE.
-
If two Datazones have the same number of active evenings or if the user only has 1 active evening in the Datazone determined to be their home, do not assign a home location. ↩