Capturing engagement numbers - strand 1 - annex 4: Pulsar case study

Question 1

Executive Summary

Accepted Answer

The Department for Culture, Media & Sport (DCMS) commissioned a research team led by Verian, a social research agency, to undertake a R&D Science and Analysis project on ‘capturing engagement numbers’. The team also includes data scientists at Faculty AI, a technology business, and academic experts from the Schools of Computer Science and Urban Studies at the University of Glasgow. The purpose of the project is to research, develop and validate the success of new methods to measure engagement with cultural and sporting events and activities. This case study represents an example of that work using social media data to estimate attendance at three different event spaces.

An initial scoping exercise highlighted different types of data that could be used with appropriate modelling techniques to provide estimates. These types included, but were not limited to, mobile app data, transport data, social media data, and aerial photography. Prioritisation criteria were developed in the scoping phase against which individual data sources could be vetted, with those that were assessed to be of the most use progressing to a stage of deeper analysis and early testing. One of the data sources that progressed to that stage for the provision of social media data was Pulsar.

Social media refers to user-generated content placed on social media platforms, such as Facebook, X (formerly known as Twitter) or Instagram. Users can post about news or events that they have participated in, and the digital traces of this can allow social media to be used as a “sensor”. However, participation in events can be virtual, for example, watching a sporting event on TV, rather than attending in person.

Social media platforms offer paid access to the posts of their users (depending on privacy settings) through programmatic Application Programming Interfaces (APIs). However, the APIs vary in complexity and functionality. Instead, social media monitoring companies, such as Pulsar, offer ways to access different social media platforms (X, Instagram, Facebook, Reddit) through a unified dashboard user interface, before later downloading the posts for further processing.

In this study, we examined X/Twitter social data acquired from Pulsar for 3 events: Farnborough International Airshow (22nd July – 26th July 2024), Scotland vs New Zealand Autumn Nations rugby match (13th November 2022) and Lewes Bonfire Night (5th November 2024). These events were analysed by classifier models - an approach that predicts what class (e.g. attendee or not) an instance (i.e. social media post) belongs to - to predict if the posts were actually about the event and if the post came from in-person attendees. A labelled dataset was developed for each event (around 200 posts for each event) to determine the accuracy of the classification approaches, which was used to estimate the quantity of attendees present on social media and perform further analysis.

Three aims were used throughout the experimental process to instruct and assess the success of utilising Pular data to estimate event attendance.

Aim 1 was concerned with using machine learning for predicting event attendance, following the work of de Lira et al. (2019). The key finding for this was that the LLM zero-shot approach classifier approach, which used a state-of-the-art instruction-tuned model, was the best classifier of attendance.
Aim 2 was concerned with applying the best models for identifying predicted attendees. Approximately 0.5%-2.5% of attendees were detected across the three events. Learning-to-quantify was further applied on the data (but not using LLM zero-shot), which produced higher and more accurate estimates of attendees.
For Aim 3, further analysis was performed on the predicted attendees of the events. Many social media users declare a home location, which enables estimation of where the attendees come from. Moreover, it was possible to qualitatively examine the anonymised social media posts to increase understanding – this is a particular advantage of social media over other types of data (e.g. mobile app data) that have been explored in other case studies of this project.

A main barrier to the access for data sources otherwise deemed worthy of more in-depth analysis was the considerable work required to meet the data protection requirements, including the source itself, how it will be accessed, and how it will be used. The other main barrier was cost, with some data sources requiring prohibitive upfront investment. There were also several limitations for the methodology reported in this case study associated with attendance counts. For example, the labelling of social media posts as to whether the author attended an event had to be done by human assessors and, as a consequence, some attendees may not have been identified due to error in human interpretation. Moreover, it is possible that the queries developed to identify attendees of an event were not wide enough to recall all relevant posts by event attendees. The approach in this case study aimed to mitigate these limitations (through methods such as double human review of the labelled posts and preview of the query results), however, future considerations for estimating attendance of events through social media data should seek to address these barriers and limitations further.

Question 2

1. Introduction

Accepted Answer

1.1 Background

The Department for Culture, Media and Sport (DCMS) are exploring new methods to measure engagement with cultural and sporting events and activities to understand and facilitate engagement across its sectors. Currently the data has been mostly collected using traditional methods. At an event-specific level this might be by manually counting attendees. At a broader level, engagement is measured via surveys, which can be expensive and can suffer from low sample sizes and recall bias. On one hand, well-designed and comparatively large surveys, such as DCMS’s Participation Survey and Sport England’s Active Lives Survey, do give robust measures of engagement at a nationally representative level for a range of cultural and sporting activities. This is coupled with the benefit of having captured useful demographic and behavioural information. However, the data provides limited understanding of attendance and participation at local level. The main challenge is with un-ticketed events, particularly those with no point of entry, which might include playing for a local sports team or taking part in a drop-in activity in an open area as part of a larger cultural programme of events.

DCMS commissioned a research team led by Verian, including world-leading data scientists at Faculty AI and academic experts from the Schools of Computer Science & Urban Studies at the University of Glasgow, to undertake an R&D Science and Analysis project on ‘capturing engagement numbers’. This project provides an opportunity to look at new methods (for example, using mobile data, aerial photography, activity data), which could go a long way to strengthening the insights gained from existing methods and filling some of the gaps in information from survey data.

1.2 Research Method Assessment Criteria

To guide the research the following research criteria were developed to underpin the assessment of each method. 

Can robust automatic attendance estimates be calculated, and for what types of events or activities?
Beyond the attendance estimates, what further detail can the methods describe about users, including demographics and participation? 
For each method, an assessment of the following: 
- Accuracy  
- Biases 
- Ethics 
- Deliverability 
- Cost 
What issues and challenges could arise during data collection for each method (e.g. sampling biases)?

For the purposes of this project, a method is typically considered a combination of data source and modelling technique to generate an estimate. The success of the method will be measured against agreed criteria, with a subsequent series of adjustments made to improve the accuracy of the estimate. The process of improvement could include the combining of one method with another method (or multiple methods), with the output of one filling the gap left by, or reinforcing the validity of, the first.

1.3 Overview of Strand 1

The approach is split into two strands:

Strand 1: To develop a comparative framework for different measurement methods for measuring engagement with cultural and sporting events and activities.
Strand 2: To develop case study examples to test the application of different measuring methods, using mixed methods where appropriate, to measuring engagement with cultural and sporting events and activities.

Strand 1 was further divided into a scoping or ‘Breadth’ phase, followed by a ‘Depth’ phase of more in-depth analysis of data sources evaluated as worthy of further investigation and modelling.

Figure 1 shows how Strand 1 breaks down into the Breadth and Depth phases

At the beginning of Strand 1, a comprehensive scoping exercise was undertaken to identify data sources that could be used. Beyond the existing experience and expertise of the assembled team, a literature review was undertaken reviewing relevant papers from the last 5 years, and stakeholder interviews were conducted with Arms-Length Bodies (ALBs) such as Sport England.

Following the initial scoping phase, the data sources were vetted against the agreed criteria. Some were progressed to a secondary phase wherein the data were acquired, and preliminary experiments involving different modelling techniques were run. This enabled an early assessment on the quality of the method. The final stage of Strand 1 involves additional evaluation and a potential recommendation of a suitable candidate for further experimentation during Strand 2. This report discusses the earlier experiment stage during Strand 1 with Pulsar Social Media data.

1.4 DPIA

Before accessing social media data via Pulsar, both data controller (DCMS) and processor (Verian, Faculty AI, University of Glasgow) were required to complete a thorough data protection and cyber security assurance process to receive approval to begin work. This was because of the sensitive nature of working with social media data, with the very small risk of processing incidental pieces of personal information that were declared on a social media post.

This process includes completing a full data protection impact assessment (DPIA), where controllers and processors must be clear about the scope and nature of processing the data concerned (including data type, size and volume), as well as an assessment of risks and appropriate mitigations in place for working with this data. Alongside this form, the consortium supplied DCMS with appropriate assurance on the specification of IT systems and security to give the Cyber-Security team confidence to sign-off processing social media data.

Following the thorough and clear DPIA process and assurance on security infrastructure, the final step in the approvals process was the publication of a privacy notice. This allows the public to be aware of the processing work happening and informs them of their rights in relation to this processing. The ICO were also informed of this work.

1.5 Experimental Aims

The over-arching aim of this project is to develop an alternative method or ‘tool’ for estimating attendance and/or engagement at event spaces. However, to investigate whether social media data could be used for the development of such a tool, three further source-specific aims were developed to instruct the experimental phase of this case study:

Aim 1: Machine learning training – Train a classifier model to label social media posts based on whether author attended the event.
Aim 2: Machine learning attendance quantification – Apply recent advances in machine learning to estimate the number of people represented in the social media datasets who attended each event.
Aim 3: Analysis of estimated event engagement output – Apply the best machine learning attendance model to analyse the predicted attendees for each event.

Question 3

2. Pulsar – social media data

Accepted Answer

Social media data offers the possibility of “social sensing” what is happening in the real-world, based on analysing the users’ behaviours and activities in the online world. Users will often make use of social media to report or comment on real-world events. Indeed, it has been previously shown that news can break first on social media (Osborne et al, 2013) and therefore monitoring social media can give a real-time insight of interest in current events.

Social media analysis is already being used by HMG across a range of use-cases, including monitoring disinformation online and whether online conversations spill over to attending real-world events. Interviews with event organisers during the Breadth phase highlighted the potential for social media to help track event attendance given how social media users post about their attendance (and interest) at cultural and sporting events. However, given the changing cultural norms and trends for how (at least) a subset of the population publicly post their location and attendance at events, a comprehensive analysis is required to investigate the viability of social media data for estimating attendance at un-ticketed events.

It is also worth noting the ‘noise’ to which this form of social sensing is particularly susceptible. This can take various forms, such as:

Implicit messaging - Not every real-world event is reflected on social media, and not every participant necessarily reflects their activity on social media. A significant volume of social media use is implicit, in that users read, but do not explicitly post/comment.
Variable locationality - Not every user explicitly commenting on a real-world event is physically present at that event.
Platform bias - Different social media platforms have their own potential bias in terms of users’ populations and demographics, with the usage of social media platforms varying by age group, socio-economic background, location, etc., as well as their specific aims and intricacies (e.g. photo oriented, short text). This means that any social media analysis will have limitations on how far results generalise to the wider population. For example, the Ofcom Adults’ Online Behaviours and Attitudes Survey (2023) (Ofcom, 2023) – shown in Figure 2 below, portrays the differences in age profiles by platform5.
Misuse - Platforms can be affected by spam postings made by bots, misinformation and dis-information by hostile agents and state actors etc. We note that this is less of an issue for cultural events, as events of focus are less likely to be targeted by such bots.

Figure 2 Distribution of Adult Users using each Social Media Platform – Source (Ofcom Adults’ Online Behaviours and Attitudes Survey 2023 (Ofcom, 2023))

2.2 Prioritisation of Pulsar data

Social media platforms offer paid access to the posts of their users (depending on privacy settings) through programmatic APIs. However, the APIs vary in complexity and functionality. Instead, social media monitoring companies, such as Pulsar offer a unified way to access social media posts from several platforms, such as Facebook, X (previously known as Twitter) and Instagram.

Of these, X has the widest data accessibility. Posts on the platform are public, and can be identified by searching for:

Keywords (either individually (e.g. Hogg) or combinations using boolean conjunctions (e.g. Scotland AND rugby)
User account names (@ScottishTeam)
Hashtags (#SCOvNZL).

In contrast, Facebook and Instagram posts can only be identified when they are made searchable by the addition of hashtags.

The formulation of queries to identify social media posts is best performed by iteratively adding terms, and measuring the impact in terms of numbers of posts identified and the coverage of the target event. In doing so, the aim is to increase the Recall of event-related posts (trying to find as many event-related posts as possible), while not decreasing the Precision (identifying too many posts unrelated to the event).

Instagram access is also facilitated by the Pulsar platform. Instagram is primarily driven by visual media (photos, videos), and is used more for personal usage, compared to Twitter/X in recent years. For this reason, adding Instagram to the scope of the study was of interest. However, as mentioned above, on Instagram (like Facebook), posts by individuals are more likely to be private, since they are only searchable when hashtags are included in the content of posts.

Instagram, like Facebook, is owned by Meta, and has tightened restrictions on access to their platforms since the Cambridge Analytica scandal. As a result, to access Instagram, Pulsar requires an API key, obtained from an Instagram Pro account with an associated Facebook page. As Pulsar is primarily concerned with brand monitoring for companies, the expectation is that the clients of Pulsar have existing Facebook and Instagram pages. Attempts to setup a new Instagram Pro account and associated Facebook page required using a new number not associated to any existing Instagram accounts and led to the account being suspended. Three attempts, using three different phone numbers not previously used with Instagram, led to the same outcome. This is a failing in the Meta bot detection algorithms, and not through any misbehaviour within this study – indeed the workflow implemented here was not distinctly different from any new company wishing to setup Facebook and Instagram presences.

The contract with Pulsar was £7500 for a 6-month duration with a limit of 200,000 mentions a month i.e. up to 1.2 million mentions. This agreement was a bespoke one, negotiated by the project team with Pulsar who ordinarily insist on a longer-term arrangement (1 year or longer). No other social media platforms were considered as alternatives to Pulsar, due to Pulsar’s existing relationship with DCMS and consortium member Faculty.AI. Pulsar’s costs are also favourable compared to purchasing the Twitter API directly, which costs $5000/month for ingesting 1,000,000 posts per month.

2.3 Pulsar data: Content and format

In general, all forms of social media provide three types of data:

Metadata – the user who wrote the post, the timestamp, some engagement statistics (likes, reposts/shares etc), and sometimes geo-location information about the post.
Media – images or videos included in the post (often the URL).
Text – any natural language comment associated to the post.

When the data is fetched from Pulsar, it is obtained in the form of a JSON object. An example is shown in Figure 3 below. This is easily parsed.

Figure 3 Example output from Pulsar API for a Twitter social media post.

Setting up Pulsar to search for posts about a new event is relatively easy. First, the social media source was identified (see Figure 4 below). The search query used was then formulated – the method was purposefully deliberated over to formulate queries in Section 3 below. Finally, Pulsar provided an overview of the posts that were returned as well as some statistics (see Figure 5 and Figure 6 below).

Figure 4: Setting up Pulsar for a new Search

Figure 5: Pulsar Overview of Posts for a Query

Figure 6: Pulsar Overview of Statistics

2.4 Overview: Use of Pulsar data

Figure 7 Overview of Pulsar Data Collection and Processing

The Strand 1 experiments address Data Collection and Data Processing, as illustrated in Figure 7. Data Collection is concerned with curating a dataset of event posts, and then manually labelling them to serve as training and evaluation data for data processing strategies. In Section 3, further details are provided on the first two steps in our analytical process which concern the Data Collection Methodology, specifically:

Curate an event posts dataset: Use the Pulsar API to access social media data, process this data into a consolidated dataset.
Label a sub-set of social media posts, in terms of “relevant to the event” and “physically attended the event” to generate ‘true positives’.

Social media analytics involves leveraging various techniques to understand and interpret data generated on social media platforms. In Section 4, a description is provided of the third stage in our process, the Data Processing Methodology, which addresses supervised, 0-shot, and k-shot methods. These methods were built upon those previously developed by the University of Glasgow, described by de Lira et al. (2019). The methods were further extended using more modern AI-based techniques (including using an LLM as a classifier).

Automatically identifying attendance from social media posts: Use the labelled dataset to train multiple models and compare results.

Finally, in Section 5, the results obtained were described for:

Quantifying attendance: Use best performing model to label entire dataset of social media posts to identify attendance from social media data
Analysing attendance: Conduct analysis to understand how many people engaged with the event, despite not attending in person, and geographic location of where posts came from.

Question 4

3. Data Collection Methodology

Accepted Answer

A methodology was developed to predict attendance at three sporting and cultural events using social media data. These test events were:

Scotland vs New Zealand Autumn International Rugby Fixture, Murrayfield, Edinburgh, 13th November 2022.
Farnborough International Airshow, Farnborough International Exhibition & Conference Centre, 22nd July 2024 – 26thJuly, 2024.
Lewes Bonfire, Lewes, 5th November 2024.

These events were selected to:

address both sport and cultural interests
address events of interest to the general public (e.g. sport or fireworks) versus those with more industrial interest (c.f. the tradeshow elements of Farnborough International Airshow);
test differing levels of virtual vs. physical participation – for instance, social media engagement with Farnborough International Airshow was anticipated to be by attendees, while the rugby would have significant social media commentary from fans that did not physically attend the game (e.g. watched on TV).
test an un-ticketed event (Lewes Bonfire night).

An overview of the data collection methodology developed to estimate attendance at these events is shown below in Figure 8. Sections 3.1 to 3.4 will describe each step of the methodology in sequence.

Figure 8: Overview of data collection from Pulsar.

3.1 Querying Pulsar (Key Word Search)

As highlighted in the overview (c.f. Pulsar Data Overview in Section 2.1), the Pulsar platform offers a querying interface that allows the user to develop queries, combined with Boolean operators such as AND and OR, as well as date and location constraints. Figure 9 below shows an example of a query constructed for Farnborough International Airshow.

As mentioned before, an iterative process of refinement was followed when constructing the queries. The general aim was to construct a query that would strike an optimal balance between recall and precision, with recall accounting for event-related relevant posts (not necessarily focussing on attendees), and precision being focused on reducing non-relevant posts. At each stage, the estimated number of posts that a query would obtain was checked.

The process was initiated by a search for related posts on social media, where relevant words or phrases that attendees may use (e.g. ‘Farnborough’, ‘air show’) would be identified. Hashtags are also added (e.g. #FIA2024) to ensure that the posts using these hashtags are identified but the # symbol need not be used in the query. Similarly, @redarrows was added, but the @ symbol is not prepended. This ensures that any post by @redarrows, as well as in response to posts by that account or mentioning @redarrows is included in the query output. Figure 9 shows the final query (a), as well as the estimates for different query formulations (b).

Figure 9 Query and estimates for Farnborough International Airshow. (a) Example of final query constructed for Farnborough International Airshow. (b) Estimated number of posts identified by different formulations.

Table 1 shows an example iterative refinement about the Scottish rugby sports event. At each stage, the estimated number of posts is recorded. The general aim of each iteration is to ensure that words or combinations of words that are too broad (i.e. not discriminative enough) are excluded or refined. For instance, at iteration 3, Scotland alone was added, which brought too many posts unrelated to the rugby event. Instead, for iteration 4, Scotland could only appear with rugby. Moreover, “AND LOCATION GB” was removed, as it was found to be restricting to users that declared a Hometown location that could be recognised by Pulsar as being in the UK.

Table 1: Methodical formulation of query for Scottish rugby event.

Stage	Number of Posts Recorded	Search Terms	Comments
1	1.5k	(rugby AND scotland OR murrayfield OR “scotland vs new zealand” OR sconz OR @scotlandteam) AND (LOCATION GB) AND NOT RETWEET
2	37k	(rugby OR scotland OR murrayfield OR “scotland vs new zealand” OR sconz OR @scotlandteam) AND (LOCATION GB) AND NOT RETWEET	Compared to (1), don’t require rugby to appear. This picks up anything Scottish
3	141k	(rugby OR scotland OR murrayfield OR “scotland vs new zealand” OR sconz OR @scotlandteam) AND NOT RETWEET	Don’t filter for GB Too wide – any Scottish tweets
4	2.8k	((rugby OR 🏉) AND (scotland OR 🏴󠁧󠁢󠁳󠁣󠁴󠁿) AND NOT RETWEET	Scotland but only in a rugby context
5	8k	@scotlandteam OR murrayfield OR MurrayfieldStad OR “scotland v new zealand” OR “scotland vs new zealand” OR scovnz OR scovnzl OR @scotlandteam OR asone OR autumninternationals OR AutumnNationsSeries OR ScottishRugby AND NOT RETWEET	All known Scottish rugby-related handles/hashtags, along with autumninternationals OR AutumnNationsSeries would likely bring in other events that week-end; No location filter. Definitely rugby, but could be other countries.
6	3k	autumninternationals OR AutumnNationsSeries AND NOT RETWEET	Not all tweets would mention Scotland.
7	3k	(rugby OR 🏉 OR autumninternationals OR AutumnNationsSeries) AND (scotland OR 🏴󠁧󠁢󠁳󠁣󠁴󠁿) AND NOT RETWEET	Combines (6) and (4).
8	4.7k	@scotlandteam OR murrayfield OR MurrayfieldStad OR “scotland v new zealand” OR “scotland vs new zealand” OR scovnz OR scovnzl OR scotlandteam OR asone OR ScottishRugby AND NOT RETWEET	This means that from the 8k in (5) above, 3.3k was indeed not specifically related to Scotland.
9	8.6k	(((rugby OR 🏉 OR autumninternationals OR AutumnNationsSeries) AND (scotland OR 🏴󠁧󠁢󠁳󠁣󠁴󠁿) ) OR (scotlandteam OR murrayfield OR MurrayfieldStad OR “scotland v new zealand” OR “scotland vs new zealand” OR scovnz OR scovnzl OR @scotlandteam OR asone OR ScottishRugby)) AND NOT RETWEET	Union of (7) & (8).

3.2 Ingestion and Data Anonymisation

Having collected the social medial posts from the queries via the Pulsar platform, the posts were then ingested into the Faculty.AI secure platform. The final number of social media posts obtained for each event is discussed in Section 5.1.

It is of note that only posts in the English language from the ingested datasets were selected. Language detection was carried out by the Pulsar platform.

To ensure the privacy of users, data was downloaded from Pulsar in a programmatic manner, through the Pulsar API, without viewing it. This programmatic download also applied a one-way anonymisation process, aiming to remove personally identifiable information, specifically the user handles. These were replaced by an “irreversible hash” of the content (such that the same user handle would get the same hash). Mentions of user handles of users other than those listed in the query were also hashed. An example of a user hash is @ee5378db529d0a579b6368d080eeca5d698c66ad.

3.3 Data Sampling Approach for Labelling

For evaluating the models (and seeing the benefit of their training), labelled subsets of the data were required, i.e. known relevant posts (about the event) where the user attended the event. In de Lira et al. 2019, a reasonable ground truth was obtained by using geolocations alongside human assessment of a selection of posts. However, accurate geolocation is no longer available on X (functionality was disabled by Twitter in 2019) – posts can only contain city level location information at best.

As there were thousands of posts for events, a pooling strategy (Voorhees & Harman, 2005) was adopted for identifying a representative sample of posts to assess, which have a chance of being relevant. This is a standard method widely used, for example, by the US National Institute of Standards and Technology in its TREC search engine ‘coopetitions’. Pooling assumes that there are already some pre-existing methods that are likely to correctly identify some attendees’ posts. Firstly, the posts predicted by each model were ranked from most confident to least confident. To form the assessment pool, the top k posts from each method were then selected. The sampling method was further stratified through the assessment of some posts further down the rankings (Yilmaz et al., 2008).

For each of the three events, a labelled dataset of 200 English-language social media posts was created. Two initial classification models (discussed more in Section 4.1) were employed to assign preliminary attendance probabilities to the posts, specifically a gradient boosting classifier and a zero-shot BART model. Then, posts were sampled based on the following probability thresholds:

50 posts where both models predicted a >85% likelihood of attendance.
50 posts where both models predicted a <15% likelihood of attendance.
30 posts where either model predicted >85% likelihood of attendance.
30 posts where either model predicted <15% likelihood of attendance.
40 posts where both models predicted a likelihood between 15% and 85%.

This stratified pooled sampling approach ensures a good use of assessors’ efforts to identify positive posts, while also providing a reasonably well-distributed dataset for model training.

3.4 Labelling Training Dataset

Two human assessors (labellers) independently reviewed each dataset, with a 40-post overlap to assess concordance of judgements. Each post was labelled in two categories:

Whether the post was about the event.
Whether the post indicated that the person attended the event.

Any discordance between the labels was decided upon by a third reviewer. The statistics of the obtained datasets and discordance between reviewers is discussed later in Section 5.1.

Finally, for each of the three labelled event datasets, the data was split into training (80%) and testing (20%) sets. Section 4 includes discussion of the approaches developed for event attendance classification.

Question 5

4. Data Processing Methodology

Accepted Answer

This section describes the techniques applied to address the aims identified in Section 1, being:

Machine Learning for Event Attendance (Section 4.1)
Quantify how many people attended the event (Section 4.2)
Analyses of Event Attendance and Engagement (Section 4.3)

4.1 Aim 1: Machine Learning for Event Attendance

Figure 10: Overview of Classification Approaches

The goal was to train a classifier to label social media posts based on whether the post’s author attended the event. Three classification approaches were explored:

Gradient Boosting Classifier – a traditional supervised machine learning classifier
Setfit – a supervised language model classifier
Large Language Model (LLM) – an instruction-tuned model (like ChatGPT) that can be asked to perform tasks without further supervised training.

Figure 10: above provides an overview of the application of the methodology applied for classification.

4.1.1 Gradient Boosting Classifier

Overview
This approach leveraged the existing technology brought from de Lira et al. (2019). Following the experiments of de Lira et al., the core machine learning algorithm used is a gradient boosting model, which is an ensemble model that trains each new model to address the errors (gradients of the loss function) from the previous models. An example tree is shown in Figure 11 below.

Figure 11: An example tree. A gradient boosted model learns many such trees, where the nth tree aims to address the errors of the previous tree(s). The data X is split at each node based on the values of a given feature, for the root (first) node in the diagram the data is split by X7 > 2 so splits data X if the 7th features is greater or less than 2

The gradient boosting model was configured using two parameters:

Max tree depth: The number of nodes or splits from the root node to the final leaf node. A max tree depth of 3 was used,
Number of trees: the number of individual trees trained to form the gradient boosting model. The number of trees used is100 trees (Friedman, 1999).

Text processing and vectorisation
To prepare the text data for the machine learning model, two approaches were used to transform the text into numerical representations that captured semantic information about the original text:

Bag of words: This method created a count of each word or pair of words that appeared in a post. To reduce sparsity, the words underwent lemmatization to standardise different variations of words, for example, eats, eaten and eating are transformed to eat. The challenge with this representation is that the wording of positive (i.e. attendee) posts in the training data must be similar to those in the unseen (test data) for the model to learn to accurately identify positive examples.
Word2vec: This approach represented each word as a vector in an embedding space, such that semantically related words are positioned closer together (Mikolov et al., 2013). For each post, the output was a n x 300 array, where n was the number of words. This was then aggregated into a 1 x 300 array by taking the maximum value across the word vectors. This provides a numerical representation of the posts with more semantic information encoded compared to the bag of words approach. The embedding approach is explored further in different approaches later in this section by means of a sentence transformer – an approach that creates an embedding from the entire sentence not an aggregated embedding of each word in the sentence (Barken et al, 2020). For further definitions see Appendix 7.1

Additional variables
In addition to the text-based vector representations of text, the following features were included in the model:

Number of URLs – may indicate photos (indicating in-person attendance), or links to longer articles.
Number of hashtags – large numbers may indicate industry/official accounts.
Number of emojis – may indicate opinion about an event.
Number of followers – may indicate industry/news accounts rather than attendees.
Number of friends – as above.
Ratio of friends and followers – as above.
Indicator of links to other social media sites (Instagram, Facebook, YouTube and foursquare).
Days between the post and the event – the prior expectation that a post is about an event decreases as the time between the post and event data increases.

Training
The machine learning model was trained using the labelled 80% training dataset, leveraging the best-performing hyperparameters from the previous study by de Lira et al (2019).

4.1.2 Setfit

Overview
A Setfit methodology was used for fine-tuning transformers. Sentence transformers, such as SBert, are similar to word2vec in that they embed or numerically represent text content in a way that conserves semantic meaning (Barken et al, 2020). However, this was built upon by using a transformer architecture to embed sentences. These transformers could be fine-tuned for specific NLP tasks with small amounts of data using contrastive learning. In practice this means that rather than the model learning to predict a label for a given input, it took two inputs and altered the embedding space to push them together if they were the same and push them further apart if they were different. Setfit has three main advantages:

It does not require prompt tuning (The development and refinement of instructions or prompts used as input into an LLM to produce an optimal output).
It requires very small amounts of data to be finetuned (roughly 8 samples)
It achieves high performance results given its size (Zhou et al., 2024).

Sentence transformers
Setfit uses a pre-trained sentence transformer. For this task, the MPNET Base V2 sentence transformer was used, as this model has shown a high performance in natural language tasks compared to other sentence transformers models such as RoBERTa (Siino 2024).

Training
The model was trained to predict attendance for each event using the training set, which was split into 80:20 training and validation sets for the hyper-parameters optimisation. For hyper parameter tuning during training, the following options were explored:

Learning rate: How much the model updates the weights after each training iteration (options for hyper-parameter training: 1e-6, 1e-3).
Number of epochs: The number of times the entire dataset is cycled through in training (options for hyper-parameter training: 1-3).
Batch size: The number of learning samples seen before the model parameters are updated (options for hyper-parameter training: 16,32,64).

The best performing run is determined using accuracy on the validation set, and the optimal hyperparameters are in turn used to train the Setfit model.

4.1.3 Large Language Model

Overview
Large Language Models (LLMs) are a family of models based on a transformer neural network architecture (Zhou et al., 2024), which allows for a better comprehension of text by contextualising individual words within the context of a broader sentence. Transformer architectures also enable more efficient training, facilitating the generation of much larger models (Vaswani, 2017). Current state-of-the-art LLMs can perform a wide array of natural language processing tasks, such as classification, text generation and summarisation, driven by an input prompt alone without the need for further fine-tuning (Wei, et al. 2022).

Models
For this task, the Llama 3.1 8 billion Instruct model was selected. The Llama 3 family models, developed by Meta, have a high performance on NLP tasks for their size (Dubey, et al, 2024) and are also ‘open source’ (although Meta does not disclose the corpora used to train them).

Prompt tuning
As above, LLMs do not require fine turning or training on a classification task. Their performance and role on a given task depend upon the input prompt. For this reason, prompts are iteratively improved to achieve better results (the used prompts in this investigation are provided in Appendix 7.2). Two different prompting methodologies were explored:

Zero-shot prompt: This is a prompt including only the instructions of the task to be completed e.g. “Below you will see a social media post, your task is to classify it as yes or no if the person attended the event”. No examples are included.
Few-shot prompt: This approach builds on the zero-shot prompt by including examples of social media posts and their corresponding label in the prompt. This deploys the in-content learning ability of LLMs – an ability to recognise patterns in labelled examples included in the prompt and apply that to an unseen example (Mann et al, 2020).

Prompts were manually tuned and updated using the training dataset and evaluated using the test dataset.

4.1.4 Transfer Learning

It was necessary to examine how well models trained on one event could be applied to different events. To achieve this, the original gradient boosting classifier model, trained on the original Creamfields festival data (de Lira et al. 2019), was applied to the three new events. As it is likely that the way users communicate their attendance on social media differs between festival goers and those attending the test events, the expectation was that the models would not perform as well as those trained using data from the same event.

4.1.5 Model Evaluation

To determine the best model approach, three models were applied to the test data and examine several measures of classification performance:

Accuracy: Proportion of posts accurately labelled (Number correct/total)
Precision: Number of posts labelled as positive that are true positives. (Number of true positives/(number of true positives + number of false positives)
Recall: Of all positive examples, how many were labelled as true positives. (Number of true positives/(number of true positives + number of false negatives)
F1 score: Harmonic mean of precision and recall. ((2*precision*recall)/(precision recall))
Balanced Accuracy: Arithmetic mean of recall and specificity (where specificity = True Negatives /(True Negatives + False Positives). It is the proportion of correctly identified negatives over the total negative predictions made by the model. Balanced Accuracy as a measure is easier to quickly interpret compared to other metrics as 0.5 is akin to a “random classifier”, and any attained performance greater than 0.5 is thus an improvement over random. ((recall + specificity)/2)

4.2. Aim 2: Quantify how many people attended the event.

This aim is concerned with using the output of classifiers to quantify attendance. Figure 12 provides an overview of this process.

Figure 12: Overview of Quantification

4.2.1 Overview

As previously noted, the over-arching aim of this study was to estimate the number of people represented in the social media datasets who attended each event. To achieve this, it was critical to implement a quantification approach—a recent supervised machine learning technique that directly estimates the relative frequencies of classes (e.g., “attended” vs. “not attended”) in unlabelled datasets with the help of supervised learning rather than labelling each individual post (Gonzalez et al, 2017). In general, quantification can be applied to any task that deals with data points whose membership in a class is uncertain, i.e., would require classification via supervised machine learning, and where the goal of the task is to estimate not which class an individual data point belongs to, but how many data points belong to a given class. Quantification has been shown to consistently outperform the standard “Classify and Count” (CC) method (including when state-of-the-art classifiers are used) when dealing with aggregate data objectives. The standard CC method can be inaccurate, mainly because of (1) the classifier’s possible bias and (2) the presence of a dataset shift (Alaiz-Rodríguez and Japkowicz 2008). Quantification has been shown to be more robust against the classifiers’ biases, as well as against dataset shifts since they do not rely solely on class probabilities.

Two quantification methods were investigated:

Simple count: Directly counts the raw classifier’s output. This corresponds to the standard CC method discussed above and was used by de Liro et al. 2019.
Adjusted count (Forman 2005): Adjusts the classifier count based on the learned true positive and false positive rates. This method aims to correct the bias in the ‘Simple count’ method. It integrates classification and counting to estimate the class quantities in a dataset. It starts with a classification step to assign class labels to data points, followed by a counting step to estimate class counts.

4.2.2 Approaches

Count best preforming classification model
Using the best-performing classification model from previous steps, the remaining unlabelled datasets were labelled to estimate the total number of attendees. First, the model was applied to classify each post in the full dataset, which determined how many posts indicate attendance. Then, unique user IDs were identified to avoid counting multiple posts from the same individual.

Adjusted count Quantification
Furthermore, an adjusted count quantifier was applied based on the gradient boosting classifier used in Aim 1. As described above, this method leverages the true positive and false positive rates learned during model training and evaluation to adjust the raw count provided by the classifier, in order to provide a more accurate estimate of attendance. The adjusted count quantifier is trained and evaluated on the test dataset before being applied to the entire dataset.

4.3 Aim 3: Analyses of Event Attendance and Engagement

Engagement vs Attendance
To determine the relationship between the number of users posting about the event and the number of users indicating event attendance, the best-performing model is used from the previous step: Aim 2. This model is in turn used to classify whether each post in the remaining dataset relates specifically to the event. The predictions then provide the data needed to analyse the ratio of users discussing the event versus those users predicted to have attended the event.

Location analysis
The Pulsar data includes country and city information for a subset of users. Counts were identified for each country with English language posts, and each city within the UK. This geographical data was plotted to visualise the distribution of users who engaged with the event, as well as those predicted to have attended. This analysis offered insights into regional interest and participation levels for the event.

Analysis of Followers and Friends
The Pulsar data includes the number of other X accounts each user in the dataset is followed by (followers), as well as the number of accounts that the user is following (friends).

In previous research finding the ratio between these followers and friends (followers/friends) enabled the categorisation of the types of users in the dataset into three main groups (Peng and Liling, 2024, Weiwei et al., 2018):

User accounts with a high ratio of followers to friends, referred to as information sharers or sources;
User accounts with roughly equal number of friends to followers, referred to as friend users;
User accounts with a significantly lower number of followers to friends, referred to as information seekers. .

Categorising accounts this way aided a better understanding of those users in the dataset who are engaging with or participating in the selected events. For example, a higher follower/friend ratio could indicate that a given account belongs to a company or large organisation. To explore this concept, the total number of followers for each user who produced a post labelled as having attended the event were pinpointed, as well as their follower friend ratio. Finally, a Man Whitney U test was implemented to analyse whether there is any statistical difference in the representation of these three user types across the three test events.

Question 6

5. Results

Accepted Answer

In this section, the collected social media posts and the subset of labelled data (following the methodology described in Section 3) are reported. Thereafter, the results and analysis of the three research aims are described.

5.1 Collected and Labelled Data for Each Event

Table 2 below describes the number of posts collected for each of the three events. Among the events, Lewes Bonfire Night was notably the smallest. In all cases, a few users posted multiple times about an event, while most users posted once. The distribution of the number of posts per user for each event is provided in Appendix 7.3.

Table 2: Statistics on collected social media posts for each event

Event	Number of Posts	Number of English Language Posts	Number of Unique Users that Posted	Number of Posts Selected for Labelling
Farnborough International Airshow	20255	16163 (79.8%)	8285	200
Scottish Rugby Autumn Internationals	13462	11114 (82.6%)	6356	200
Lewes Bonfire Night	2044	1878 (91.9%)	1277	200

As described before, for each event, 200 posts were selected to be labelled by two human assessors, following the pooling methodology described in Section 3.4. Table 3 below reports the statistics of the obtained labels. For each event, each assessor judged 120 posts each: 20 of these were labelled by the 2 assessors.

Table 3: Statistics of the 200 labelled social media posts for each event.

Event	Posts About Event	Posts Attended Event	Total Discordance Between -Assessors %	About Event Discordance Between Assessors %
Farnborough International Airshow	107 (54%)	53 (26%)	2 (2.5%)	1 (2.5%)
Scottish Rugby Autumn Internationals	98 (49%)	15 (8%)	2 (2.5%)	2 (5.0%)
Lewes Bonfire Night	126 (63%)	32 (16%)	3 (3.75%)	2 (5.0%)

On inspection of Table 3, the following observations can be drawn: around half of the posts were unrelated to the events – for instance, there was a Woman’s New Zealand rugby event that was also taking place around the same time. Some other selected tweets were observed to be concerned with issues related to race; for example, some tweets used the phrase “all blacks” to discuss issues related to race rather than the long-standing nickname for the New Zealand men’s rugby team. The presence of posts not relating to the events is not concerning, as the manual queries produced in Section 3.1 were intended for Recall rather than Precision.

The high virtual participation on Twitter for the Scottish rugby event is also noticeable, with only 15% (15/98) of event-related posts being clearly from match attendees. In contrast, Lewes is ~25% (32/126) and 50% for Farnborough International Airshow (53/107).

The final two columns concern the agreement between assessors. In general, the assessors agreed on both event-related and attendee judgements. Such minor levels of disagreement are unlikely to affect the use of this labelled data for choosing the most effective classifiers (Voorhees, 2000).

Having built the labelled datasets, we partitioned these into train and test sub-datasets (80% training, 20% test).

5.2 Aim 1 – Comparison of Event Attendance Classifiers.

Table 4 below provides the classification accuracies for the three Strand 1 test events, in terms of event attendees: Farnborough International Airshow (2024), Scotland vs New Zealand (2022) and Lewes Bonfire Night (2024). Classification accuracies are reported in terms of Precision and Recall, as well as a combination measure (F1, Accuracy, Balanced Accuracy). These are measured using the labelled datasets.

Table 4.a-c: Comparison of classification accuracies for the three events

a. Farnborough International Airshow

Model	F1	Precision	Recall	Accuracy	Balanced Accuracy
Gradient Boosting Classifier - Transfer Learning	0.59	0.83	0.45	0.83	0.71
Gradient Boosting Classifier	0.59	0.83	0.45	0.83	0.71
Setfit	0.62	0.53	0.73	0.75	0.74
Llama3 8B Instruct zero shot	0.72	0.64	0.82	0.83	0.82
Llama3 8B Instruct few shot	0.54	0.38	0.91	0.58	0.68

b. Scottish Rugby

Model	F1	Precision	Recall	Accuracy	Balanced Accuracy
Gradient Boosting Classifier - Transfer Learning	0	0	0	0.90	0.49
Gradient Boosting Classifier	0	0	0	0.90	0.49
Setfit	0.21	0.12	1	0.43	0.69
Llama3 8B Instruct zero shot	0.67	0.50	1	0.93	0.96
Llama3 8B Instruct few shot	0.21	0.16	1	0.43	0.69

c. Lewes Bonfire Night

Model	F1	Precision	Recall	Accuracy	Balanced Accuracy
Gradient Boosting Classifier - Transfer Learning	0.22	0.50	0.14	0.83	0.58
Gradient Boosting Classifier	0.40	0.67	0.29	0.85	0.57
Setfit	0.35	0.22	0.80	0.63	0.70
Llama3 8B Instruct zero shot	0.44	0.50	0.40	0.88	0.67
Llama3 8B Instruct few shot	0.26	0.15	1	0.30	0.60

On analysing the table, the focus was primarily on Balanced Accuracy, referring to other measures where appropriate. Firstly, the performance of the Gradient Boosted classifiers was considered. In general, these were uniformly among the lowest performing classifiers. In particular, they didn’t identify any of the attendees in the test dataset for the Scottish Rugby event. Performance was better for the other events, but the highest recall of event attendees was only 29% for the Lewes event, and 45% for the Farnborough International Airshow (notably, precision was reasonable for Farnborough International Airshow). Accuracy using the original model (transferred from the festival training data of de Lira et al. (2019)) showed no difference from training with the new event specific data.

Next, the Setfit language model-based classifier was generally better than the gradient boosting classifiers (providing the highest performance for Lewes) in terms of Balanced Accuracy. For instance, for Farnborough International Airshow, it identifies 73% of all the attendees’ tweets in the labelled data, but in doing so identified 47% of tweets from non-attendees (i.e. 1.0 – Precision).

Finally, the variants of the LLM classifier were examined. The zero-shot classifier is generally the best event attendance classifier of all those examined (highest Balanced Accuracy on two events, and only marginally lower than Sefit on the Lewes labelled data). Comparing between the zero- vs. few-shot Llama3 instantiations, it was a surprise to observe that the zero-shot classifier provided the highest accuracy. Indeed, in-context learning for LLMs is a widely used effective technique in the recent literature, and its lower accuracy was unexpected here.

Some of the failures of the classifiers were identified: for instance, the classifier may struggle to separate the Scotland-New Zealand (men’s) rugby event from another women’s rugby event taking place around the same time – for example “Congratulations, to [name] who represented Scotland, in the final of the Women’s Rugby World Cup in New Zealand, this morning. She had a great game” was classified positively; For the Farnborough International Airshow, false positives may hint at attendance by others, rather than the poster themselves – e.g. “[company name] will be attending the Farnborough International Airshow, designed to pioneer the commercial space age, starting July 22nd. Our leaders, including our CEO, @[handle], and our COO, @[handle], will be present at the event and are looking forward to meeting you.”. Finally, for Lewes, false negatives were observed, in that the classifier might not have enough evidence from the text of the posts, but had it analysed the associated media (images or videos) -- which the human labellers had access to -- the classifier might have been able to make a correct positive prediction. An example false negative in this category was “Amazing work as always Lewes 👏🎇🔥🎉 https://t.co/XXXX” (which had photos of fireworks and effigies that were burnt on the bonfire).

Overall, the application of the Llama3 zero-shot classifier was recommended, which exhibited the highest accuracy across the datasets.

5.3 Aim 2: Estimate how many People Attended the Event

To achieve Aim 2, an examination the total number of attendees identified was required, using the Llama3 zero-shot classifier, applied to the complete datasets of all English social media posts obtained from Pulsar.

Table 5 details the number of attendees at each event, as identified from 3rd party sources; as well as the total number of social media posts, and the raw number of attendees predicted to have attended from the social media, based on the output of the best classifier from Section 5.2 (Llama3 zero-shot). Moreover, the percentage of known attendees that appear to have attended is reported.

Table 5 Attendee Predictions. The footnotes link to the 3rd party sources for the number of attendees at each event.

Event	Attendees	Total Posts Identified from X	Of which Predicted Attendees (Llama3 zero-shot)	Attendee Prevalence from Social Media
Farnborough International Airshow (ticketed)	100,385^{[footnote 1]}	8,285	2,768	2.76%
Scottish Rugby (ticketed)	67,144^{[footnote 2]}	6,356	361	0.53%
Lewes Firework (unticketed)	40,000^{[footnote 3]}	1,277	220	0.55%

On examining the table, it can be observed that the strongest signal was for the Farnborough International Airshow – indeed, 2-3% of attendees used X, while 0.5% of attendees for both the Scottish Rugby and Lewes Firework events were detected. The likely reasons for the distinct properties of Farnborough International Airshow attendees are discussed in Section 5.4.1 below.

Learning-to-quantify (aka quantification), discussed in Section 4.2, is one approach for improving quantification estimates. At its heart, quantification (in this case adjusted count methods), allows for more accurate total counts of attendance through accounting for the error in the initial classifier. These adjusted classifiers can then be used for obtaining more accurate counts on the larger dataset.

Unfortunately, learning-to-quantify has only been tested on traditional supervised machine learning techniques. It should be adaptable to those based on language models (e.g. Setfit), however obtaining posteriors from a generative LLM is more challenging^{[footnote 4]}. This research challenge will be investigated in future work, and instead the learning-to-quantify methods on the gradient boosted (GB) classifier was applied, even though it is less accurate than the LLM.

Table 6 below reports the result of the quantification experiments. Firstly, the number of labelled attendees (and the prevalence of attendees in the training data) is reported. Then, for the complete datasets (i.e. all English social media posts collected for each event), the number of posts is shown, as well as the number of predicted attendees for each event, according to firstly the LLM zero-shot classifier, as well as the GB classifier – these follow the traditional “classify-and-count” paradigm. The final column reports the estimates of the GB classifier after they have been adjusted by the adjusted count quantification method. Notably, while the GB classifier produces lower estimates (expected, as its Recall is lower, see Table 4), applying the quantification methods leads to higher estimates, more aligned with the ground truth number of attendees. It can be concluded that adjusted count quantification has promise in adjusting estimates obtained using classifiers, but more R\&D is required to apply it to the latest accurate LLM-based classifiers.

Table 6 Attendance quantification estimates – in terms of number of users – for each of the three events. Numbers in parenthesis are prevalence in the corresponding dataset.

	Training Data			Compete Dataset
Event	True counts	True proportion	Number of posts	LLM Predicted Classify-and-Count	GB Classifier Classify-and-Count	GB Classifier Adjusted Count
Farnborough International Airshow	11 (29%)	29%	8,285	2,768 (33%)	1496 (18%)	3976 (48%)
Scottish Rugby Autumn Internationals	3 (7%)	7%	6,356	361 (5%)	187 (3%)	445 (7%)
Lewes Bonfire Night	5 (13%)	13%	1,277	220 (17%)	83 (6%)	178 (14%)

Finally, the concordance between the classifiers is analysed. As previously noted, in the labelled datasets, labels were produced for both “aboutness” (i.e. the post was about the event) or “attended”. For a post to be labelled as an attendee, it had to be about the event. These two levels of human labelling allow the development of independent classifiers for About and Attended, and the application of these to all obtained social media posts. Therefore, the concordance between these classifiers can be considered. Results are reported in Table 7 below – specifically, the number of unique users and percentage of users, for “About”, “Attended”, and both “About and Attended”. From the table, the prevalence of About posts for each event was about 23-26% - this emphasises the usefulness of classifying posts for relevance, and not relying on over-tuning the queries used for selecting social media posts through Pulsar.

However, there is also a proportion of posts predicted as ‘attended only’, which have not also been predicted as ‘about event’. This is most likely due to false positive predictions on posts predicted as ‘attended event’ - such as those detailed above with Scottish rugby - or false negative predictions from the classifier on posts predicted as ‘about event’.

The values for LLM-predicted results in Table 6 are roughly equal to the combined totals of ‘about and attended’ and ‘attended only’ shown in Table 7. However, for the former, this results in a slightly lower value as in both cases the number of unique users is being counted, in Table 6 this is counting unique users authoring post classified as attending, however in Table 7 this group is split into those posts classified as both about and attending, and only attending. This indicates that there is a sub-set of users who have written multiple posts about the event, some which were classified as both ‘about and attended’, and others which are classified as ‘attended only’. In these cases, those users have been double counted in Table 7.

Table 7 Number of unique users about, and number of attending unique users, as well as the intersection (LLM zero-shot classifier).

Event	About Only	About Only %	About and Attended	About and Attended %	Attended Only	Attended Only %
Farnborough International Airshow	2133	25.75	1719	20.75	1388	16.75
Scottish Rugby Autumn Internationals	1471	23.14	185	2.91	193	3.04
Lewes Bonfire Night	328	25.69	124	9.71	106	8.3

5.4 Aim 3 – Analysis of Estimated Engagement

Below details an analysis for each of the three events, demonstrating some of the value in using social media for attendance monitoring. This included analysis of the friend/follower ratio of those posting, and their home location.

5.4.1 Farnborough International Airshow

Figure 13 shows the distribution of home countries from which the tweets have originated, based on the stated location of the user in their X profile. As can be observed, while the UK is most frequent, the US also has a high predominance of predicted attendees. The full list of UK locations stated by social media attendees is included in Appendix 7.3.

To explain the high number of attendees from the US, it was hypothesised that the airshow is not just a public event, but also an industrial tradeshow, with many industrial attendees. Indeed, Farnborough International Airshow reports^{[footnote 5]} that it had visitors from 114 countries, members of the media from 56 countries, and exhibitors from 41 countries.

With the industrial nature of the event in mind, the follower/friend ratio of the predicted attendees was examined and compared across all three events. This is shown in Figure 14. From the figure, it can be seen that most attendees have approximately the same number of followers as the users they follow, and there are more accounts with larger numbers of followers predicted as attendees of the Farnborough International Airshow. The difference in the follower/friend ratio between Farnborough International Airshow and both the Scottish rugby internationals and Lewes bonfire night events was found to be significant (p = 0.000, P=0.000) whereas the difference between Scottish rugby internationals and Lewes bonfire was not found to be significant (p=0.025; Bonferroni correction). This emphasises the likelihood of many industrial attendees, who will have accounts with a large number of followers.

Figure 13: Distribution of Predicted Attendees by Country - Farnborough International Airshow

Figure 14: Follower Friend Ratios for predicted attendees for each event.

5.4.2 Scottish Rugby

Figure 15 below shows the country distribution of the predicted attendees at the Scotland-New Zealand Rugby Game. As expected, the majority of posts are from the UK, the next most frequent country was the United States, followed by New Zealand (17 predicted attendees). The latter is not unexpected – e.g. either players or fans that would have travelled to attend the game.

Figure 15: Number of Posts indicating Attendance at Scottish Rugby Autumn International, by Country

Table 8 shows the distribution of locations of predicted attendees within the UK for the Scottish rugby event – only locations mentioned by more than 1 users are shown; a further 39 locations were mentioned only once. Edinburgh^{[footnote 6]} (the location of the game) appears most frequently (44), followed by Glasgow (12). Other Scottish towns include some with good transport links to Edinburgh (e.g. Aberdeen, Falkirk, Inverkeithing, Dundee, Dunblane, Dunbar, St Andrews) as well as others that are quite distant from Edinburgh (Peterhead, Stranraer, Ayr and Prestwick). Towns known for their rugby focus on the Scottish Borders are also mentioned (Hawick, Melrose). In other cases, unitary council or regions are mentioned (e.g. South Ayrshire, Scottish Borders, Moray, Highland). Finally, referring back to Figure 14, it is argued that the distribution of followers/friends is similar to that of Lewes rather than Farnborough International Airshow, suggesting most accounts are personal in nature.

Table 8: All locations mentioned more than once in the predicted Scottish rugby attendees

Location	Number of Posts About Event	Number of Posts Attended Event	Location	Number of Posts About Event	Number of Posts Attended Event
No City Information	373	56	Newcastle upon Tyne	8	3
Edinburgh	125	39	Alva	2	2
Glasgow	78	12	East Lindsey	1	2
London	91	9	Goring	3	2
Aberdeen	18	6	Hawick	3	2
Cardiff	23	6	Kirkwall	11	2
Dundee	8	6	Melrose	1	2
Moray	3	6	Monifieth	1	2
Scotland	5	6	Oxfordshire	2	2
City of Edinburgh	26	5	Paisley	1	2
County Durham	5	5	Perth	4	2
Fife	7	5	Saint Andrews	1	2
Birmingham	7	3	South Shields	2	2
Inverness	2	3

5.4.3 Lewes Bonfire 2024

Table 9 below provides information about the predicted attendees, as obtained from the users’ profile information. The most frequent location is Lewes itself, followed by London (1 hour by train) and Brighton (10 miles, 17 minutes by train). Again, from Figure 14, it was observed that the distribution of followers/friends is distinct from the industry-focussed Farnborough International Airshow event.

Table 9: UK locations of predicted attendees at Lewes fireworks.

Location	Number of Posts About Event	Number of Posts Attended Event	Location	Number of Posts About Event	Number of Posts Attended Event
Bath	1	1	Leeds	2	1
Belfast	2	2	Lewes	51	23
Brighton	29	20	Lewisham	1	1
Bristol	2	2	London	39	21
City of London	2	1	Margate	1	1
East Sussex	9	3	No City Information	83	46
Eastbourne	3	1	Poynings	29	8
Eccleshall	1	1	Staffordshire	2	2
Hertfordshire	1	1	Wealden	4	2
Hove	1	1	West Sussex	6	2

Question 7

6. Conclusions

Accepted Answer

This work was concerned with using the Pulsar social media aggregator platform to analyse social media posts on X (aka Twitter) about 3 events. These were analysed by classifier models to predict if the posts were actually about the event and from actual attendees. A labelled dataset was developed for each event (around 200 posts for each event) to determine the accuracy of the classification approaches, which used to estimate the quantity of attendees present on social media and perform further analysis. Below is a summary of the achievements and findings (Section 6.1), limitations (Section 6.2), and considerations for future iterations (Section 6.3).

6.1 Achievements and Findings

As detailed in Section 3, datasets of social media posts from X were acquired for three events, totalling 35,761 posts across all three event spaces. 600 subsets of tweets were labelled, with excellent agreement between assessors. Furthermore, the event attendance methodology proposed by de Lira et al. (2019) was applied and updated in order to investigate how well the methodology performed when applied to events other than music festivals, as well as to utilise recent state-of-the-art classifiers (SetFit, a Transformer language model and Llama3, a large language model). Finally, a recent learning-to-quantify approach from the literature was detailed, which can be used in combination with classifiers to refine estimates of quantities. It should be noted that throughout these processes all social media data was anonymised on ingestion, to reduce the risk of users being identifiable.

The findings with respect to the three experimental aims identified in Section 1.5 are detailed below.

Aim 1 was concerned with using machine learning for predicting event attendance, following the work of de Lira et al. (2019). The key finding was that the best classifier of attendance was the LLM zero-shot approach classifier approach, which used a state-of-the-art instruction-tuned model.

Aim 2 was concerned with applying the best models for identifying predicted attendees. Approximately 0.5%-2.5% of attendees were detected across the three events. Learning-to-quantify was further applied on the data (but not using LLM zero-shot), which produced higher and more accurate estimates of attendees.

For Aim 3, further analysis was performed on the predicted attendees of the events. Many social media users declare a home location, which enables estimation of where the attendees come from. The international nature of Farnborough International Airshow was clearly visible from the social media data; travelling supporters for the Scottish Rugby game were also identified - including both supporters from New Zealand following the team on tour, as well as Scotland team supporters from known rugby hotspots. Further observation showed that the industrial nature of the Farnborough International Airshow event was visible from the social media data collected, in that more of the attendee accounts had a higher number of followers and more followers than friends. Lastly, it was possible to qualitatively examine the anonymised social media posts to increase understanding – this is a particular advantage of social media over, say mobile app data, that has been explored in other case studies of this project.

Finally, a reflection on the project in terms of the research assessment criteria detailed in Section 1.2:

Accuracy: A test dataset labels was developed to determine the accuracy of the attendance classifiers (Section 5.2); Moreover, the prevalence of attendance was estimated in the obtained datasets for each event, and compared to attendances based on ticket sales or official estimates (Section 5.3).
Biases: The biases of social media data were discussed in Section 2.1 where the biases of the data for particular events were noted (e.g. industry vs. personal participation in Section 5.4.1).
Ethics: There is a high level of awareness concerning discomforts around social media monitoring. The project undertook a Data Protection Impact Assessment and notified the ICO before commencing on this work (see Section 1.4). The public have also been informed of this work through the publication of a privacy notice on Gov.uk. Moreover, all social media posts were anonymised on ingestion to reduce the risk of identifying individuals.
Deliverability: There were clear challenges concerning the ability to obtain a new account in Instagram with the correct privileges, despite being assured that this is possible by Pulsar.
Cost: Pulsar’s costs for obtaining social media access were reasonable compared to accessing the X API directly. Expertise was needed in machine learning. However, it’s notable that the Llama3 LLM would actually need less expertise in the longer-term.

6.2 Limitations

Social Media Data Sources: The intention was to analyse both X and Instagram. Indeed, the nature of use of X has evolved over the years and is now less likely to be used for personal events. Intuitively, Instagram is more likely to be used for attending personal events – e.g. posting photos of attending rugby or fireworks, or photos of planes at an airshow (Johnson, 2024).

Attendance Counts: Although low attendance counts were observed for the personal events, statistical sampling methods, such as finite population correction (FPC), can be applied to ensure that sample sizes and statistical estimates are accurate and efficient, especially when dealing with finite populations. This adjustment prevents overestimation of the sample size and is particularly important when the sample size exceeds 5% of the total population. Initial analysis suggests statistically significant attendance counts for the Farnborough International Airshow event.

Modelling: The developed classification models almost exclusively use the text of the post (one exception – the GB classifier had additional hand-engineered features). This ignores any information coming from other media attached to the social media posts, such as images or videos.

Labelling: In the original work by de Lira et all. (2019), the ground truth for training the classifiers was formed by looking for users who made geo-tagged posts from the festival. As this granularity of geolocation is no longer possible on X, this study relied more on human labellers for all training data. Human labellers had to make decisions about their interpretation of social media posts; it is likely that some users that were attendees were not identifiable, even by human assessors.

Querying social media using Pulsar: Queries were generated manually for identifying social media posts; this followed an iterative procedure, where the counts of selected posts were inspected. This iterative counting process was done outside of Pulsar to avoid consuming the limited (and expensive) post collection budget. However, use was made of the X search engine to examine what kinds of posts were retrieved by various keywords^{[footnote 7]}. It is possible that the queries were not wide enough to recall all relevant posts by event attendees; anecdotally, the iterative refinement process gave reasonable coverage of attendees.

6.3 Considerations for Further Iterations

Social Media Data sources:

The issues regarding Instagram access should be surmountable; Other possibilities include consideration of alternative data sources offered by Pulsar, such as Facebook public pages, which may provide access to other demographics.
Other online sources can also be explored, for example Tripadvisor which includes reviews of places and events.

Classifiers:

A small language model-based classifier, SetFit, already demonstrated improved attendance accuracy over the gradient boosting classifiers of de Lira et al. (2019). There are more classifier models that could have been attempted, and other hyperparameters of the model(s) that could have been better tuned. However, with the limited amounts of training data available for each event (this study was limited to 200 posts per event), a better direction for future work would be learning a general classifier to work across events, by combining all human labelled annotations available from de Lira et al. (2019) and the three events here.
In predicting event attendance, it is evident that being able to make inferences using the media (photos/images) attached to social media posts may improve the classification accuracy. The potential of this kind of processing has been enhanced in recent years by the advent of multi-modal LLMs such LLaVA (Large Language and Vision Assistant)^{[footnote 8]}. A simple integration may be to append social media posts with a textual description of the attached media [“photo of fireworks”], such that the classification LLM can take this information into account.
The limited amount of training data means that LLM-based approaches, which require no training per-se, are more promising. This can be improved through longer development of the input prompt for the LLM (known as prompt engineering). This can be carried out manually or through Dspy^{[footnote 9]}, which is a promising tool that can automatically enhance prompts using some labelled training data.
The low-accuracy of the few-shot (in-context learning) LLM was not expected – it is possible that better examples would have enhanced the model (Margatina et al. 2023; McKechnie et al. 2024) or through using a larger LLM.

Learning-to-quantify:

It was not possible to apply learning-to-quantify on the LLM classifier at this stage, as this is more challenging than with the conventional classifiers. In this regard, it may be possible to reformulate the LLM classifier to give a posterior likelihood for each prediction.
It would be interesting to consider applying learning-to-quantify for attendance prediction across a number of events, to give better estimates of prevalence.

Question 8

7. References

Accepted Answer

Alaiz-Rodríguez and Japkowicz 2008. Alaiz-Rodríguez, R, Japkowicz, N. Assessing the impact of changing environments on classifier performance, in: Proceedings of the Canadian Society for Computational Studies of Intelligence, 21st Conference on Advances in Artificial Intelligence, Canadian AI ‘08, Springer-Verlag, Berlin, Heidelberg, 2008, pp. 13–24. https://link.springer.com/content/pdf/10.1007/978-3-540-68825-9_2.pdf

Barkan, Oren, et al. “Scalable attentive sentence pair modeling via distilled sentence embedding.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020. https://ojs.aaai.org/index.php/AAAI/article/view/5722/5578

De Lira et al. (2019). Vinicius Monteiro de Lira, Craig Macdonald, Iadh Ounis, Raffaele Perego, Chiara Renso, Valeria Cesario Times, Event attendance classification in social media, Information Processing & Management, Volume 56, Issue 3, 2019, https://doi.org/10.1016/j.ipm.2018.11.001. 

Dubey, Abhimanyu, et al. “The Llama 3 herd of models.” arXiv preprint arXiv:2407.21783 (2024). https://arxiv.org/abs/2407.21783

Feng, Jinyuan et al. (2024). Jinyuan Feng, Zaiqiao Meng and Craig Macdonald. TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation. In Proceedings of EMNLP 2024. https://arxiv.org/abs/2406.11460

Forman, George. (2005). Counting positives accurately despite inaccurate classification. In Proceedings of the European Conference on Machine Learning (ECML’05). 564–575. https://shiftleft.com/mirrors/www.hpl.hp.com/techreports/2005/HPL-2005-96R1.pdf

Friedman (1999). Jerome H. Friedman. Stochastic Gradient Boosting. https://web.archive.org/web/20140801033113/http:/statweb.stanford.edu/\~jhf/ftp/stobst.pdf

González, Pablo, et al. “A review on quantification learning.” ACM Computing Surveys (CSUR) 50.5 (2017): 1-40.

Johnson, Brianna (2024). Twitter vs Instagram: Which Platform Is Better? https://penji.co/twitter-vs-instagram-better-business/

Mann, Ben, et al. “Language models are few-shot learners.” arXiv preprint arXiv:2005.14165 1 (2020). https://arxiv.org/abs/2005.14165

Margatina et al. (2023). Katerina Margatina, Timo Schick, Nikolaos Aletras, Jane Dwivedi-Yu. Active Learning Principles for In-Context Learning with Large Language Models. https://arxiv.org/abs/2305.14264

McKechnie et al. (2024). Jack McKechnie, Craig Macdonald, Graham McDonald. Context Example Selection For LLM Generated Relevance Assessments. Under review.

Mikolov et al. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 3781. https://arxiv.org/pdf/1301.3781

Ofcom, 2023. Ofcom, Adults’ Media Use and Attitudes report 2023, 2023. https://www.ofcom.org.uk/siteassets/resources/documents/research-and-data/media-literacy-research/adults/adults-media-use-and-attitudes-2023/adults-media-use-and-attitudes-report-2023.pdf?v=329409

Peng, Yi, and Liling Lu. “Untangling influence: The effect of follower-followee comparison on social media engagement.” Journal of Retailing and Consumer Services 78 (2024): 103747. https://www.sciencedirect.com/science/article/abs/pii/S0969698924000432

Siino (2024). All-MPNet at SemEval-2024 Task 1: Application of MPNet for Evaluating Semantic Textual Relatedness. Marco Siino. Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pages 379–384 https://aclanthology.org/2024.semeval-1.59.pdf

Vaswani, A. “Attention is all you need.” Advances in Neural Information Processing Systems (2017) https://arxiv.org/pdf/1706.03762

Voorhees, Ellen M., and Donna K. Harman. “The text retrieval conference.” TREC: Experiment and evaluation in information retrieval (2005): 3-19.

Ellen Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5):697–716, September 2000. https://www.sciencedirect.com/science/article/pii/S0306457300000108

Yilmaz et al. (2008). Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. 2008. A simple and efficient sampling method for estimating AP and NDCG. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR ‘08). Association for Computing Machinery, New York, NY, USA, 603–610. https://doi.org/10.1145/1390334.1390437

Wei, Jason, et al. “Emergent abilities of large language models.” arXiv preprint arXiv:2206.07682 (2022) https://arxiv.org/abs/2206.07682

Yan, Weiwei, Yin Zhang, and Wendy Bromfield. “Analyzing the follower–followee ratio to determine user characteristics and institutional participation differences among research universities on ResearchGate.” Scientometrics 115.1 (2018): 299-316. https://link.springer.com/article/10.1007/s11192-018-2637-6

Zhao et al. (2024) A Survey of Large Language Models. https://arxiv.org/abs/2303.18223

Question 9

8. Appendix

Accepted Answer

8.1 Definitions

Vector: A list of numbers that represent a point in multi-dimensional space. In this context, the vector represents a word or sentence. This vector should encode some semantic meaning of the word or sentence it represents with respects to other vectors.

Array: A list of multiple vectors.

Embedding space: A multi-dimensional space in which vectors are represented.

Transformer model: A type of machine learning model that is used for natural language processing. It builds on previous models by not considering each word in sequence when performing tasks but takes into context the relationships between other words in the sentence to extract greater context.

Sentence Transformer model: A machine learning model that builds upon transformers. It finds a numerical representation of a sentence as a whole, not just individual words.

Appendix Figure 1: Shows how words are embedded in an embedding space to conserve semantic meaning

8.2 LLM Prompts

“””
Your task is to determine whether a social media posts suggests that the author or another person attended a particular event.

{event} is {event_description}

Instructions:
- Yes: Consider the event description provided. Classify as “Yes” if it includes mention or implication that the author participated in or will participate in, or was physically present at the event.
- No: Classify as “No” if the social media post does not suggest the author attended the event, it is about the event but does not indicate past or future attendance.

Ensure the post references the specific event and indicates the poster being physically present
Return only either “Yes” or “No”

Post: {post}

Response:

“””

8.3 Distribution of Number of Posts per User

Appendix Figure 2: The distribution of the number of posts per user made to each event. In all cases, a long-tail distribution is obtained – many users making one post, with a few users making more than one post. Farnborough International Airshow has one account making the most number of posts (250). (a) Scottish rugby international. (b) Farnborough International Airshow. (c) Lewes Bonfire Night.

(a) Scottish Rugby International

(b) Farnborough International Airshow

(c) Lewes Bonfire Night

8.4 UK Locations of Farnborough International Airshow Attendees

Table 10: Distribution of UK Locations of Attendees of Farnborough International Airshow

Location	Number of Posts About Event	Number of Posts Attended Event	Location	Number of Posts About Event	Number of Posts Attended Event
Abingdon	7	6	Lancaster	1	1
Aldershot	3	1	Leeds	3	3
Amber Valley	1	1	Leicester	5	3
Ash	1	1	Lewes	1	1
Basildon	1	1	Lincoln	3	6
Basingstoke	3	2	Lincolnshire	3	1
Bath	3	1	Liverpool	20	21
Belfast	8	6	London	678	435
Birmingham	15	11	Macclesfield	2	2
Bishop’s Stortford	1	1	Manchester	23	17
Blackburn	3	1	Merseyside	2	2
Blackpool	1	1	Newbury	1	1
Bolton	2	1	Newcastle upon Tyne	1	2
Bordon	2	1	Newport	1	4
Bournemouth	4	4	No City Information	735	622
Bracknell	1	1	North Shields	1	1
Bristol	24	26	North West	1	1
Burnley	3	1	Northamptonshire	4	4
Cambridge	15	11	Norwich	3	4
Canterbury	2	4	Nottingham	6	4
Cardiff	2	3	Oxford	19	12
Cheshire	1	1	Oxfordshire	4	9
Chester	2	2	Peterborough	166	70
Chesterfield	2	3	Plymouth	1	1
Chichester	2	2	Portsmouth	6	6
Chippenham	1	1	Poynton	11	18
Christchurch	3	5	Preston	8	4
City of London	1	1	Purton	1	1
Clarencefield	2	2	Reading	16	8
Clarkston	1	1	Redditch	1	2
Coatbridge	1	1	Ribble Valley	1	2
Coniston	1	1	Richmond	1	2
Cornwall	3	2	Rotherham	2	1
Cotswold District	1	1	Saint Helens	1	1
County Durham	1	2	Saintfield	1	1
Coventry	10	8	Sandown	1	1
Cranfield	28	27	Sheffield	7	9
Crawley	2	1	Shrewsbury	1	1
Crewe	1	1	Shrivenham	1	1
Cumbria	1	1	Shropshire	3	3
Deal	4	3	Slough	1	2
Denham	2	3	South Yorkshire	2	2
Derbyshire	3	1	Southampton	4	2
Devon	2	2	Southport	5	11
Doncaster	2	1	Staffordshire	2	5
Dorset	1	1	Stevenage	1	1
East Lindsey	1	1	Stourbridge	2	1
Eastbourne	5	3	Surrey	6	6
Edinburgh	10	4	Surrey Heath	11	4
Egham	1	1	Swansea	2	3
Erewash	1	1	Swindon	3	7
Essex	3	2	Test Valley	1	3
Exeter	3	4	Tewkesbury	4	2
Fairford	1	1	Thetford	1	4
Farnborough	64	45	Thornton	1	1
Farnham	2	2	Torquay	3	4
Fife	1	1	Torridge	3	1
Gillingham	1	1	Uxbridge	1	1
Glasgow	19	13	Vale of White Horse	3	3
Gloucester	3	3	Wakefield	2	1
Gloucestershire	2	5	Waverley	8	10
Greater London	1	2	West Berkshire	3	1
Guildford	12	6	West Lindsey	1	1
Halesowen	1	1	West Midlands	2	2
Hampshire	28	18	Weybridge	5	4
Harrogate	2	2	Winchester	2	4
Hart	6	7	Witney	1	1
Hereford	1	1	Woking	7	2
Holt	1	2	Wokingham	1	2
Huntingdonshire	1	1	Wolverhampton	1	1
Ipswich	2	1	Yeovil	2	5
Kent	5	3	York	1	1

https://www.farnboroughinternational.org/what-we-do/farnborough-airshow/ ↩
https://www.autumn-internationals.co.uk/2022/scotland-v-new-zealand.html ↩
https://www.bbc.co.uk/news/articles/c9dlgjg8lq1o ↩
In particular, the LLM classifier does not produce a confidence/posterior, which is required by the quantification methods. It is possible to ask the LLM for a single token and examine the logits of the tokens it considered generating (as used by Feng et al, (2004)) but this requires further code changes as well as empirical validation on learning-to-quantify datasets. ↩
https://www.farnboroughinternational.org/what-we-do/farnborough-airshow/ ↩
Interestingly this appears in the data as Edinburgh and City of Edinburgh. There is no such duplication for other cities such as Glasgow. ↩
The X search engine only has limited coverage of historical events. ↩
https://github.com/haotian-liu/LLaVA ↩
https://dspy.ai/ ↩

Cookies on GOV.UK