Data landscape review report 2024 to 2025

Question 1

1. Executive summary

Accepted Answer

1.1. Overview of the report

The Online Safety Act, which received Royal Assent in 2023, marks a major step improving online safety outcomes in the United Kingdom (UK). Alongside the legislation, the UK Government has focused on supporting innovation and the growth of a dynamic ecosystem of technology startups and innovators. The UK’s world-leading safety tech sector is at the forefront of tackling online harms. Central to this innovation is high-quality data, which is essential for training, testing, and improving Artificial Intelligence (AI) and machine learning models, products, and services.

In January 2021, the Department for Digital, Culture, Media, and Sport (DCMS) launched the Online Safety Data Initiative (OSDI) to explore how to improve access to high-quality data for online safety technologies. The initiative identified key barriers to data access and sharing, which were further examined in the 2022 Data Landscape Review. That review confirmed that online safety tech providers continue to face major data challenges. As part of this evolution, the sector has seen a shift in needs: the emphasis has moved from data quantity to data quality, with an increasing demand for high-quality, well-labelled, and representative datasets that reflect real-world scenarios.

Commissioned by the Department for Science, Innovation and Technology (DSIT), this report builds on those findings and examines how data access needs have evolved since January 2023. It assesses the current data challenges faced by online safety tech providers, reviews recent data sharing initiatives, and provides a broad context for these findings within the current policy and market landscape.

This report, focuses specifically on data access and sharing challenges related to online harms and the online safety tech sector, including the technologies and initiatives used to address them, as opposed to general issues with data sharing across the economy or sectors.

1.2. Key findings

Barriers to data access and to data sharing can be grouped into five categories:

Competition and commercial barriers
Legal and regulatory barriers
Data availability barriers
Data quality and standardisation barriers
Ethical and cultural barriers

The key barriers summarised below are not exhaustive but reflect the most prominent themes identified through desk research and stakeholder engagement, and are explored in more detail throughout the report.

1.2.1.1. Barriers to data access

Competition and commercial

Online safety tech providers are concerned about the high costs and resources required to acquire and license high-quality datasets, as well as the reluctance of online platforms to share data and maintain transparency.

Legal and regulatory / Ethical and cultural barriers

The legal and ethical acquisition and usage of data is a concern for online safety tech providers. They face challenges in acquiring third-party closed datasets due to data protection and intellectual property laws, alongside legal and reputational risks when using openly accessed data with unclear provenance, particularly for sensitive harm types. While these restrictions reflect necessary safeguards - grounded in data protection, the European Convention on Human Rights (ECHR) and equalities legislation - they can create complexity in accessing and using data responsibly. Ongoing government initiatives aim to clarify and support responsible data sharing in this context.

Data availability barriers

Despite a strong appetite, access to government and law enforcement databases remains a challenge for safety tech providers, likely due to legal, policy, and security considerations.

1.2.1.2. Barriers to data sharing

Legal and regulatory barriers

Secure, sensitive data sharing remains a major concern for safety tech providers, primarily due to legal and ethical concerns - such as the risk of breaches or misuse - which, while recognised as important, may still present practical challenges. This is especially true for sensitive data on child sexual exploitation and abuse (CSEA) and data involving minors.
Safety tech providers also face limitations on combining datasets for expanded usage due to data protection laws and contractual obligations, obstructing cross-platform data pooling and innovation.

Data availability / Data quality and standardisation barriers

The lack of data standardisation limits data sharing, particularly among online safety tech providers using content classification and hashing techniques, making it harder to share, integrate, and compare information across platforms.

1.2.2. Evolving data needs

Advancements in AI, particularly large language models (LLMs), have reduced reliance on large-scale, real-world datasets for certain online safety tech applications, opening new opportunities to use synthetic data - especially when it reflects the complexity of real-world online harms.
Online safety tech providers have highlighted the need for multimodal data - datasets that combine different content types such as text, audio, images, and video - which are currently limited. Access to this type of data is essential for developing multimodal solutions in response to emerging threats.
It is particularly crucial that training data reflects real-world online harms, including ‘noisy data’ - imperfect or low-quality content like blurry images or poor lighting - that mirrors the variability of online environments and helps train AI models to perform accurately in real-world conditions.
Industry stakeholders highlight the growing importance of real-time data, enabling more immediate and accurate threat detection and response.

1.2.3. Solutions to overcome barriers

Synthetic data has emerged as one of the most prominent technologies for addressing data scarcity and enabling more flexible data usage, though ethical concerns, such as the accidental creation of child sexual abuse material (CSAM), remain.
Privacy Enhancing Technologies (PETs) - tools that can help maximise the use of data by reducing risks inherent to data use^{[footnote 1]} - are increasingly explored by online safety technology providers to analyse and share sensitive data while maintaining privacy. One example which has gained particular attention from academia and some providers is federated learning, which enables AI models to be trained across different organisations without the centralised collection of training data.^{[footnote 2]}
The online safety tech sector has seen a noticeable increase in data access and sharing collaborations. These include collaboration among safety tech providers, partnerships with non-profit organisations (NGOs), engagement with public authorities such as regulatory sandboxes, cross-sector data sharing, and bilateral or multilateral data sharing partnerships.

Question 2

2. Introduction

Accepted Answer

2.1. Background

In January 2021, the DCMS launched the OSDI programme to test methodologies to facilitate better access to higher quality data and resources for developing and testing online safety technologies.

The following definition is used by the UK Government: ‘Safety tech providers develop technologies or solutions to facilitate safer online experiences, and protect users from harmful content, contact or conduct.

As part of Phase 1 of the OSDI programme, a Data Landscape Review was commissioned, which uncovered a range of data access and sharing barriers faced by online safety tech providers. Key barriers included:

Substantial data gaps for some harm types
Inconsistent documentation
Lack of high-quality datasets, datasets typically only contain text
Use of different taxonomies
Commercial use is not always allowed
Closed data are held by different parties including online platforms, government, online safety tech, researchers and civil society organisations
Safety and ethics
Cost, time and logistics
Laws and regulations

Despite this, the safety tech sector has continued to mature rapidly, both in the UK and internationally. In this context, the 2022 Data Landscape Review was commissioned to establish areas of change and continuity in data access and sharing, revisiting the above barriers. It highlighted that while acquiring and labelling high-quality data is essential, significant barriers to data sharing remain. Nonetheless, government initiatives are driving positive change.

2.2. Objectives of the report

Since the 2022 Data Landscape Review, the online safety landscape has been evolving. Over the past two years, bad actors have increasingly exploited emerging technologies and new digital environments. In response, the online safety tech sector has continued to evolve at a rapid pace, developing innovations to combat these threats. In this context, this report (2025 Data Landscape Review) was commissioned by the DSIT to update the understanding of data access needs and barriers faced by safety tech providers since January 2023.

Building on the 2021 Data Landscape Review and the 2022 Data Landscape Review, the report aims to:

Establish areas of continuity and change compared to findings from the 2022 Data Landscape Review;
Explore and highlight new data sharing initiatives and mechanisms relevant to the online safety tech section;
Understand how these findings fit into the wider policy, market and ecosystem context.

As part of the 2025 Data Landscape Review, PUBLIC has:

Conducted 13 one-to-one stakeholder interviews with a representative cross-section of the online safety tech sector, including UK and international providers, industry associations, non-profit organisations, and regulators.
Issued an online survey, targeting a diverse range of online safety tech providers, with 9 responses received.
Reviewed and analysed 55+ reports relevant to this research.

For details on the methodology, see the Appendix.

This report presents key findings from the 2025 Data Landscape Review, integrating insights from the 2021 and 2022 Data Landscape Reviews to provide a comprehensive, up-to-date view of the evolving data landscape.

Question 3

3. The Policy and Safety Tech Market Context

Accepted Answer

3.1. Key policy and market trends

Since January 2023, the online safety tech sector has continued to grow, despite a challenging global investment climate and persistent data access barriers.^{[footnote 3]} Investor interest remains steady, particularly in emerging solutions leveraging generative AI (GenAI). However, as an AI-driven industry, access to high-quality, representative training data remains a critical constraint.

Globally, new online safety legislation is driving the expansion and globalisation of the safety tech market. Key developments include the UK’s Online Safety Act (OSA), the EU’s Digital Services Act (DSA), the US Kids Online Safety and Privacy Act (KOSPA), India’s forthcoming Digital India Act, and Australia’s Online Safety Amendment (Social Media Minimum Age) Act. These regulations are increasing the demand for standardised, high-quality datasets, crucial for advancing safety tech solutions. Yet, while these efforts are pushing the market forward, their long-term impact on data access and sharing remains uncertain as their practical implications continue to unfold.

In the UK, the OSA, which received Royal Assent in October 2023, marks a significant step in regulating online safety by introducing new duties on social media companies and search services to reduce the risk of illegal activity on their platforms and remove illegal content once identified.^{[footnote 4]} While the OSA is seen as a key driver for the UK safety tech sector, its true impact will largely depend on how effectively Ofcom, the regulator leading its implementation, exercises its powers. According to industry stakeholders interviewed during this research, there is a prevailing ‘wait and see’ attitude toward its implementation. While the OSA clarifies legal frameworks, its enforcement and practical application remain uncertain, leaving the industry unsure of its long-term impact.

Interviews with online safety tech providers reveal cautious optimism, though many expressed doubts about how far online safety legislation will deliver meaningful change. In particular, the pace of regulatory development and the cost of compliance are seen as barriers to progress. Despite these challenges, there is hope that clear guidance and strong enforcement will ultimately allow the sector to grow and develop more effective solutions.

Alongside the OSA, the Data (Use and Access) Bill, introduced in Parliament in October 2024, aims to modernise the UK’s data protection framework. It includes new rules to support responsible data sharing where there is a clear public interest - such as safeguarding vulnerable people or preventing crime - providing organisations greater legal clarity while maintaining strong protections for individuals’ rights. The Bill sits alongside complementary frameworks such as the UK Digital Identity and Attributes Trust Framework, which establishes rules for creating secure, reliable digital identity services.

New technologies are influencing both the threat landscape and data access solutions in the safety tech sector. GenAI introduces both new risks and commercial opportunities. It enables the rapid creation and widespread distribution of harmful content, with low technical entry barriers facilitating its proliferation. The realism of AI-generated material makes it increasingly hard to distinguish from authentic material, and the growing volume of extreme and harmful content raises concerns about desensitisation and the normalisation of harmful behaviours.^{[footnote 5]} Fine-tuning models with biased data further amplifies these risks, complicating safety efforts.^{[footnote 6]}

However, synthetic data is emerging as a promising solution. By creating artificial datasets that simulate real-world scenarios, synthetic data is gaining increasing interest in safety tech. It helps to overcome barriers in data collection, storage, and sharing, offering a viable and innovative alternative to traditional data sources and accelerating the development of safety tech solutions.^{[footnote 7]}

3.2. Relevant UK Government Interventions

Between 2021 and 2024, the UK Government has launched a number of flagship initiatives to support the development of the online safety tech sector, with some specifically aimed at improving data access and sharing.

Online Safety Data Initiative (OSDI)

Launched by the then UK DCMS in January 2021, the OSDI aimed to facilitate secure and ethical access to online harms data. It focused on designing approaches that enable trusted entities to access data needed to develop and improve safety tech solutions.^{[footnote 8]}

As part of this, PUBLIC reviewed the key barriers to data access and sharing faced by the safety tech sector over two years (2021 and 2022), producing two internal reports that helped DCMS better understand safety tech providers’ pain points and needs, while also identifying new data sharing initiatives and mechanisms relevant to the sector.

Safety Tech Challenge Fund (STCF) Round 1 and 2

The Safety Tech Challenge Fund was launched to support the development of innovative technologies to address key online safety challenges. Round 1 focused on protecting children in end-to-end encrypted Environments (E2EE),^{[footnote 9]} while Round 2 funded projects tackling the detection and disruption of CSAM links.^{[footnote 10]} The Fund reflects the UK Government’s broader commitment to advancing technical solutions in the safety tech sector.

Privacy Enhancing Technologies (PETs) Prize Challenges

Launched in 2022, the PETs Prize Challenges were a joint initiative between the UK and US governments to drive innovation in PETs. The challenges supported the development of solutions that enabled secure data collaboration while protecting privacy and intellectual property, addressing global issues such as financial crime and public health emergencies. These efforts demonstrated the potential of federated learning to unlock the benefits of data collaboration, especially as global data privacy regulations evolved. The initiative highlighted the value of international collaboration between governments, regulatory bodies, and innovators.^{[footnote 11]}

Deepfake Detection Challenge

Launched in May 2024, the Deepfake Detection Challenge was a joint effort between the Home Office, the DSIT, the Alan Turing Institute, and the Accelerated Capability Environment (ACE) to develop innovative and practical solutions focused on detecting fake media. The Challenge highlighted that creating a dataset that was more representative of real-world operational scenarios would have been helpful. As part of developing next-step recommendations, ACE created a reusable ‘gold standard’ dataset. This dataset was designed to effectively test detection models, including those targeting CSAM.^{[footnote 12]}

Question 4

4. Data Access Landscape

Accepted Answer

4.1. Why is data important for online safety tech providers?

Safety tech providers rely on data to develop tools that detect and mitigate harmful online content and behaviour. These tools often utilise AI, which requires extensive and diverse datasets to train and test models effectively. AI systems are inherently data-dependent; they learn from training data, which acts as educational material for the model. If this data is inadequate, incomplete, or inaccurate, the resulting AI models may produce unreliable or incorrect outcomes.^{[footnote 13]} The performance of these systems is therefore directly linked to the quality of the data they are trained on. Access to high-quality data enables providers to improve the accuracy and reliability of their AI systems, ensuring they can identify and respond to online threats appropriately.

In recent years, the focus in AI development has shifted from primarily enhancing algorithms (a model-centric approach) to emphasising the quality and diversity of the data used (a data-centric approach). This shift reflects the principle that ‘garbage in, garbage out’ - highlighting the importance of how data is acquired, labelled, and used in building effective AI systems.^{[footnote 14]} Experts consistently emphasise the critical role of high-quality data in building robust classifiers^{[footnote 15]}. Moreover, as interviewees pointed out, data is essential not only for training but also for testing, requiring larger and more diverse datasets. Using the same data for both training and testing would result in overly optimistic (and unrealistic) outcomes. Without sufficient and varied data, enhancing the performance of AI models becomes a significant challenge.

For more details into how safety tech providers use data, and to explore the key questions around dataset features, data acquisition, and labelling processes, see the 2022 report in the Appendix.

Despite the clear importance of data, safety tech providers face substantial barriers to data access that hinder their ability to develop and refine these AI systems. Understanding these challenges is key to unlocking the full potential of safety tech innovation.

The online safety tech sector faces several barriers to growth. The most pressing concerns, as highlighted by most interviewees and survey responses, are market demand and awareness. However, data access and sharing remain a persistent challenge, which has seen little change over the past two years, and continues to hinder innovation and sectoral growth.

Barriers can be grouped into five categories (* represents new barriers. The rest are persisting barriers):

a. Competition and commercial barriers:

Barriers to data access:

Low perceived value
High cost and resource requirements

Barriers to data sharing:

Platforms’ unwillingness to share data

b. Legal and regulatory barriers:

Barriers to data access:

Shifting regulatory landscape and enforcement uncertainty
Legal access constraints

Barriers to data sharing:

Security and privacy concerns

c. Data availability barriers:

Barriers to data access:

Discoverability barrier
Difficult access to government data
Data transparency
Access to multimodal data*

Barriers to data sharing:

Data silos

d. Data quality and standardisation barriers:

Barriers to data access:

Quality of data
Need for ‘noisy data’ in synthetic data*

Barriers to data sharing:

Differences in classification standards

e. Ethical and cultural barriers:

Barriers to data access:

Data provenance and ethical risks
Ethical concerns surrounding synthetic data*

Barriers to data access

Online safety tech providers continue to face significant barriers to data acquisition, many of which have remained consistent. These barriers can be grouped into five categories:

a. Competition and commercial

Low perceived value: Many customers struggle to recognise the value offered by safety tech solutions, particularly as long-term benefits are often difficult to quantify. This makes it difficult for providers to finance the costs of data acquisition - especially given that developing or acquiring high-quality datasets is both costly and time-consuming. As a result, adoption is slowed and commercial incentives to invest in better data are reduced, creating a barrier to data access.
High cost and resource requirements: Acquiring more data requires significant effort, time, and financial investment, even before the labelling process begins. For small or early-stage providers, this can divert key staff away from other critical tasks. With budgets shrinking, many providers, particularly startups, have pivoted from areas like child protection to more commercially viable areas such as brand protection. The barriers to entry remain high, and as a result, many startups initially rely on open data sources but quickly encounter limitations, unable to access the higher-quality, proprietary data needed to scale their solutions.

b. Legal and regulatory

Shifting regulatory landscape and enforcement uncertainty: Providers have highlighted the importance of complying with local laws and regulations, which can vary by geography, content type, and its origins.^{[footnote 16]} While recent regulatory developments, such as the OSA, are seen as a positive step for online safety, their impact depends on effective implementation. The OSA and accompanying regulatory frameworks are still in the early stages, which may lead to some uncertainty among providers about its long-term effects and enforcement consistency.
Legal access constraints: Data access remains impacted by legal frameworks, including privacy laws and intellectual property protections, which - while recognised as necessary - can create challenges to access. Online platforms retain control over their data, often limiting external access unless through direct partnerships. For example, one interviewee noted that to access data for a solution aimed at protecting children online, they had to sign licence agreements and pay to access adult content for model training, due to commercial IP concerns. Even small modifications, such as blurred images, can complicate legal and copyright rights, further restricting data access. Ongoing government initiatives, such as the new National Data Library (NDL) project,^{[footnote 17]} aim to improve secure access to public sector data and enable more efficient, responsible data use across government and the wider economy.

c. Data availability

Data discoverability: Accessing relevant datasets remains a challenge for online safety tech providers. Even open-source datasets can be difficult to find, and one survey respondent noted that accessing data from social media platforms - whether through official programmes or scraping - has become more difficult, making it harder to understand evolving harms and develop accurate detection tools.
Difficult access to government data: Access to government data is difficult for online safety tech providers. While some noted limited collaboration - one provider specifically mentioned working with the police - interviewees generally agreed that accessing government held data remains a significant challenge. One initiative addressing this is the Child Abuse Image Database (CAID) (see below), which provides a secure, controlled environment for law enforcement and relevant agencies to access critical data. Some providers expressed cautious optimism that initiatives like the UK Digital Identity and Attributes Trust Framework may improve access.

Child Abuse Image Database (CAID)

The Child abuse image database (CAID) is a secure, national UK repository of illegal Child Sexual Abuse Material (CSAM), including images, videos, and associated metadata. Acquired by UK police and the National Crime Agency (NCA), CAID facilitates collaboration across UK law enforcement and plays a crucial role in combating CSEA. Developed by the Home Office in partnership with the police, industry, and international technology companies, CAID enhances investigative efficiency, supports indecent image removal, aids victim identification, and assists in apprehending offenders. It also contributes to global efforts to combat CSAM, partnering with organisations such as the Internet Watch Foundation (IWF) and international law enforcement agencies.

Data transparency: Data transparency also remains a significant challenge for online safety tech providers. Platforms often hold prevalence data on the scale of online harms but are reluctant to share this information. This lack of transparency makes it difficult for safety tech providers to fully understand the extent of specific online threats and impedes the development of effective solutions. The OSA introduces mandatory transparency reporting for certain categorised services, requiring annual disclosures on harmful content and safety measures. Although still pending implementation, this marks a positive step toward improving transparency.^{[footnote 18]}
Access to multimodal data: While AI’s ability to analyse combined text, audio, and video offers powerful potential for online safety, access to this data is limited. One interviewee noted that while AI models are now technically capable of identifying nuanced harms like grooming, without access to diverse, multimodal datasets, their potential remains unrealised. In addition, high-quality, labelled datasets that include multiple modalities are scarce, and collecting and annotating them is time-consuming and expensive. Inconsistent data quality across modalities further affects the performance of multimodal systems.^{[footnote 19]}

d. Data quality and standardisation

Quality of data: Online safety tech providers require high-quality datasets to train and test effective AI models. These datasets should be sufficiently large, consistently labelled, diverse, and representative. In particular, these datasets need to reflect real-world conditions, otherwise the performance of the AI will not perform as designed and expected. Real-world data is often ‘noisy,’ with issues like low resolution, poor lighting, and inconsistent quality. Therefore, it is important for the training and testing datasets to include harmful content that reflects this noise, noted by several interviewees. A dataset with only ‘perfect’ data is not truly high quality - it lacks the diversity and complexity needed to train accurate models.
Need for ‘noise’ in synthetic data: AI models require exposure to ‘noisy data’ - content that includes imperfections like blurry images, background clutter, or poor lighting - to train effectively for real-world conditions. This challenge is amplified by the increasing use of synthetic data, which, while valuable for addressing data scarcity, often lacks the imperfections of real-world data. As interviewees noted, synthetic datasets can be ‘too perfect,’ missing the variability and noise present in actual online environments. This makes it harder to build models that accurately reflect the complexities of online harms.

e. Ethical and cultural

Data provenance and ethical risks: Providers face both reputational and legal risks when using data with unclear provenance, particularly for sensitive harm types. Datasets sourced from clients or openly accessed may lack transparency regarding their origins, complicating compliance and raising concerns about the legality and ethical use of the data. Data protection regulations, such as the UK General Data Protection Regulation (GDPR), require that personal data be obtained lawfully and fairly, which can be challenging when the source of the data is uncertain. While legal exemptions exist - such as those under Article 14 of the UK GDPR-providers appeared especially concerned about ethical risks, including uncertainty around the integrity of training data, potential data poisoning, and opaque data labelling practices, all of which could affect data quality.
Ethical concerns surrounding synthetic data: While synthetic data offers new alternatives for data acquisition, there are risks, such as the accidental creation of CSAM when generating content using GenAI. To mitigate these risks, tools have been developed to prevent the unintended creation of harmful content, yet the ethical implications remain a significant concern for providers, as highlighted in both interviews and survey responses.

Barriers to data sharing

While many barriers to data access persist, challenges to data sharing add an additional layer of complexity. Technical, legal, and organisational obstacles continue to impede effective collaboration across platforms. These barriers can be grouped into four categories:

a. Competition and commercial

Platforms’ unwillingness to share data: Many large platforms are reluctant to share data, viewing it as a key competitive asset. With limited incentives to make their data accessible, these platforms only share it when it aligns with their financial or regulatory interests. As one respondent noted, platforms are often hesitant to make changes without clear regulatory guidance. As a result, safety tech providers often struggle with accessing enough data to develop effective solutions. Platforms continue to prioritise retaining control over their data for commercial advantage, further hindering progress in the sector.

b. Legal and regulatory

Security and privacy concerns: Interviewees recognised that strong safeguards and compliance with data protection laws - such as the UK GDPR and the Data Protection Act 2018 - are essential when handling highly sensitive data, especially CSEA material and data involving minors. At the same time, secure data sharing remains a significant challenge for safety tech providers, primarily due to privacy concerns and the complexity of implementing those safeguards in practice. These challenges are further heightened when sharing data across borders. In a project involving cross-national data sharing, one interviewee described the significant challenges posed by the lengthy vetting process. They detailed legal reviews, data protection impact assessments, and multiple access agreements required before data transfer, which severely complicated their collaborative work.

c. Data availability

Data silos: Safety tech providers face constraints combining datasets for expanded usage due to data protection laws and contracts. This limitation obstructs cross-platform data pooling, impacting risk intelligence systems. Although providers may have agreements to use data from multiple platforms for specific purposes, aggregation for broader use is prohibited for legal and commercial reasons. An interviewee noted that these constraints limit intra-platform data sharing for model training across different tools, preventing the use of readily available data within a platform, despite its clear potential.

d. Data quality & standardisation

Differences in classification standards: The lack of standardisation in classification methods and hashing techniques creates cross-platform challenges, making effective data sharing, integration, and comparison difficult. However, initiatives like the Lantern Program are working to overcome these challenges by creating standards that enable secure and responsible sharing of signals across platforms.

Lantern program

The Lantern Program, launched in late 2023 by the Tech Coalition, brings together technology companies and select financial institutions to share signals related to online CSEA. These signals - such as CSAM hashes, email addresses, usernames, and keywords related to grooming or purchasing CSAM - serve as indicators of harmful activity, enabling platforms to identify and act on threats in real time, while also strengthening reporting to the National Center for Missing and Exploited Children (NCMEC) and law enforcement, supporting investigations into illegal activity.^{[footnote 20]} By establishing clear guidelines for data sharing, Lantern promotes cross-platform collaboration while ensuring privacy and ethical standards are met. Built with safety and privacy by design, the program undergoes regular reviews and stakeholder engagement, representing a significant step forward in improving data sharing practices within the online safety sector.^{[footnote 21]}

Despite the persistent challenges of data access and sharing in the safety tech sector, recent advancements are beginning to address some of these barriers. Through technological innovations and evolving industry practices, providers are finding new ways to reduce reliance on large datasets and overcome data scarcity. Key resolved barriers include:

Reduced dependence on large datasets through AI advancements: Recent advancements in AI and LLMs, such as data augmentation, transfer learning,^{[footnote 22]} and few shot prompting,^{[footnote 23]} have started to alleviate some of the reliance on large datasets for certain online safety tech applications. As one interviewee noted, LLMs now enable content classification with less training data while still developing effective models. This reduces the need for extensive datasets to build accurate classifiers. This innovation is especially valuable in safety tech solutions where acquiring large, labelled datasets has traditionally been a major challenge.
Overcome data scarcity by leveraging external data: As noted in interviews, some providers have found alternative solutions to data scarcity by leveraging data from other providers, enabling them to develop and deploy safety tech solutions without the need to classify or manage the content themselves. Additionally, many platforms are shifting from data-driven to service-driven approaches, which enhances collaboration and improves the sharing of information across platforms. These advancements are helping to reduce the barriers posed by data acquisition, making it easier for safety tech providers to scale their solutions while minimising the need for extensive proprietary datasets, all while ensuring privacy is maintained.

Question 5

5. Evolving Data Needs & Emerging Technologies

Accepted Answer

5.1. New data needs

Advancements in AI, particularly with LLMs and multimodal AI systems, have reduced reliance on large-scale datasets for certain online safety tech applications. The ability of some LLMs to perform content classification using zero-shot and few-shot learning capabilities reduces the need for extensive task-specific datasets.^{[footnote 24]} This development enables safety tech providers leveraging AI to focus on the refinement and improvement of their models, rather than the acquisition and management of vast raw data.

Despite these advancements, high-quality data remains essential for safety tech. It is particularly important that data reflects real-world online harms, including ‘noisy data’ that mirrors the variability and imperfections of online environments and is representative of criminal imagery. As an interviewee noted, ‘noisy data’, such as low-resolution or poorly lit images, is crucial for training AI models to identify harmful content accurately.

Moreover, the importance of real-time data has increased in the evolving landscape of online safety, as recognised by industry stakeholders. The immediate analysis of live data streams enables faster decision-making and more proactive threat detection.^{[footnote 25]} The quicker a potential threat is identified, the sooner it can be contained,^{[footnote 26]} improving contextual awareness and preventing escalation. However, using real-time data requires robust data protection measures to ensure privacy and maintain user trust.^{[footnote 27]}

5.2. New technologies in addressing data needs

Faced with persistent data access and sharing barriers, the online safety sector is exploring innovative technologies to address critical data needs. This was a key focus of the research, with stakeholders across the sector discussing workarounds employed to tackle insufficient datasets. Synthetic data emerged as a prominent and impactful technology for addressing data scarcity and enabling more flexible data usage. Desk research and stakeholder input from both interviews and the survey further highlighted the following promising technologies for addressing these challenges.

5.2.1. Synthetic data

Synthetic data refers to data which is generated programmatically by mimicking real-world phenomena. There has been an increase in the use of synthetic data by online safety tech providers to address data scarcity, offering a feasible alternative for training AI models when large, real-world datasets are unavailable. This trend reflects growing interest since the last report, as highlighted in both interviews and survey responses.

As the collection and management of datasets for online safety becomes more complex - due to privacy concerns, the sensitivity of digital trace data, and potential harms to both annotators and data subjects - synthetic data offers a promising solution.^{[footnote 28]} By generating novel, fabricated data that mimics real-world datasets, synthetic data helps augment existing datasets, increasing both their volume and diversity. It also enables the creation of hypothetical scenarios that enhance system robustness and reduce bias by representing underrepresented groups. While primarily used for testing and validation, synthetic data alleviates the reliance on customer-supplied examples, providing an alternative for generating training datasets in specific use cases.

Stakeholder interviews highlight the value of synthetic data in anonymisation and its role in facilitating compliance with privacy regulations, such as the UK GDPR. By augmenting datasets with synthetic data, like neutral images or noise, providers can improve model accuracy and reduce false positives, particularly in content moderation.

While synthetic data holds promise, its practical application is limited by ethical concerns and the challenge of generating realistic data for sensitive domains like CSAM. One of the interviewed providers noted that synthetic data can still enhance machine learning training for neutral content to detect CSAM, but concerns remain. Surveyed providers also expressed uncertainty about the quality and real-world relevance of synthetic data, as well as the lack of understanding of data provenance, which could lead to potential legal and reputational risks.

Case study 1: Enhancing Machine Learning (ML) with synthetic data

Organisations involved: SafeToNet

Type of initiative: Use of synthetic data

Geography: UK

Specific data access / sharing barriers addressed: Limited real-world image/video data for training ML models

Type of data used: Synthetic data

SafeToNet generates synthetic data to improve ML training for neutral content, addressing data scarcity challenges in Safety Tech. Their AI-generated imagery and video replicates real-world conditions such as poor lighting, low-quality cameras, and blurring, ensuring more accurate content moderation while reducing false positives that could disrupt user experience.

To maintain ethical AI practices, SafeToNet has implemented strict safeguards to prevent the accidental creation of synthetic CSAM. These controls ensure their generative AI only produces safe, neutral content, supporting responsible AI development.

By augmenting datasets with relevant noise and distortions, their synthetic data enhances model robustness, allowing AI systems to perform effectively under real-world conditions without relying solely on sensitive or difficult-to-source real imagery. Their ability to generate synthetic variations provides additional training diversity, supplementing real-world datasets where access is limited.

This initiative helps overcome data access barriers by providing a scalable and privacy-conscious way to expand training datasets without full dependence on proprietary or user-generated data. By integrating synthetic data, SafeToNet ensures more consistent data availability, enabling improved AI performance while reducing reliance on external sources and mitigating ethical and legal risks.

Building on the increasing recognition of synthetic data’s potential, several initiatives are leveraging this technology to address data access challenges. For example, in the financial sector, where access to sensitive data is highly controlled, synthetic data offers a compliant alternative, helping to overcome data scarcity, improve model accuracy, and support effective risk detection.

Case study 2: Leveraging synthetic data for AI-driven financial crime detection^{[footnote 29]}

Organisations involved: The Alan Turing Institute, Plenitude Consulting, Napier AI, and the Financial Conduct Authority (FCA)

Type of initiative: Use of synthetic data

Geography: UK

Specific data access / sharing barriers addressed: Restricted access to real-world financial transaction data for AI-driven financial crime detection

Type of data used: Synthetic data

A collaboration between Napier AI, Plenitude Consulting, The Alan Turing Institute, and the Financial Conduct Authority (FCA) will leverage the FCA’s Digital Sandbox to develop a fully synthetic dataset of anonymised financial transactions, enriched with a wide range of money laundering typologies. This initiative addresses data access barriers by providing a privacy-compliant environment where firms can train, test, and validate AI models for money laundering detection without relying on restricted real-world financial data.

Limited access to sufficiently realistic financial data has been a significant barrier to advancing AI-driven money laundering detection. By generating synthetic data that accurately reflects financial crime patterns, this initiative enables institutions to rigorously assess AI models’ effectiveness in a controlled setting, while also evaluating the dataset’s ability to support effective risk detection.

This project will help financial institutions enhance crime detection while ensuring regulatory compliance. It sets a precedent for broader industry adoption of privacy-preserving techniques, showing how synthetic data can address data access challenges and enable AI-driven safety technologies across regulated sectors without compromising sensitive information.

5.2.2. Use of open-source intelligence (OSINT)

OSINT techniques refer to the collection and analysis of publicly available data. By leveraging information from sources like social media posts, news articles, and public records, organisations can identify emerging risks and trends without relying on proprietary datasets. This approach can help overcome data limitations while ensuring compliance with legal and regulatory requirements, as well as platform policies. In response to legal and privacy restrictions to data access, analysts are developing innovative OSINT techniques, such as ‘group of interest’ analysis, which examines aggregated data and indirect connections to infer details about individuals.^{[footnote 30]}

While OSINT methods are established and have proven useful in various contexts, there appears to be less enthusiasm to adopt this technique in the safety tech sector, despite its potential. This could be due to a variety of factors, including the evolving nature of online threats and the perceived complexities of implementation.

Case study 3: Integrating OSINT to address data access gaps in threat detection^{[footnote 31]}

Organisations involved: Resolver and Liferaft

Type of initiative: Partnership for OSINT integration

Geography: UK and Canada

Specific data access / sharing barriers addressed: Limited access to proprietary datasets

Type of data used: Open data

Resolver partnered with an OSINT provider, LifeRaft, to address data access limitations in threat intelligence. LifeRaft’s Navigator platform - which collects real-time data from social media, forums, and the deep and dark web - enhances Resolver’s incident response capabilities by offering real-time monitoring of publicly available online content. Through this integration, Resolver clients gain improved visibility into potential risks without relying on proprietary data sources.

By combining OSINT with digital risk management tools, the partnership supports proactive threat detection across the public online ecosystem, enabling faster and more informed responses. This model expands access to high-quality threat intelligence while remaining compliant with data privacy regulations.

The collaboration shows how OSINT solutions can address intelligence gaps by offering scalable, timely risk detection without requiring direct access to sensitive or restricted datasets.

5.2.3. Privacy Enhancing Technologies (PETs)

PETs refer to technical methods that protect the privacy or confidentiality of sensitive information.^{[footnote 32]} These technologies aim to reconcile the need for data access with the imperative to protect personal information, facilitating compliance with regulations such as the UK GDPR.

While PETs offer considerable benefits, their adoption across the online safety tech sector appears to be limited. Providers interviewed and surveyed generally acknowledge the potential of PETs but do not consistently prioritise or highlight them as impactful solutions for addressing data challenges. This may arise from technical complexities, a lack of widespread awareness, or perceived integration costs. Nonetheless, as data privacy concerns intensify, the strategic deployment of PETs could significantly enhance data sharing practices while safeguarding user privacy.

Federated learning

Among the various PETs, federated learning emerges as a particularly promising technology in online safety. It enables models to be trained locally on users’ devices, sharing only aggregated updates. Unlike traditional machine learning, which requires sending raw data to a central server, federated learning brings the model to the data, ensuring sensitive information remains on the user’s device.^{[footnote 33]} This approach minimises the need for direct data sharing, enhancing privacy compliance and aligning with data minimisation principles. Particularly valuable for sensitive applications like content moderation, federated learning is generating growing interest, with a few providers and researchers in academia exploring its use.

Case study 3: Federated learning for privacy-preserving content moderation^{[footnote 34]}

Type of initiative: Use of federated learning

Geography: Applicable globally across social media platforms

Specific data access / sharing barriers addressed: Limitations on accessing user data for AI moderation due to privacy regulations and platform restrictions on data sharing

Type of data used: User-generated text data from social media (simulated using Twitter datasets)

Harmful content is a persistent challenge on social media, but traditional machine learning models require centralised data collection, creating barriers due to privacy regulations and platform policies. This research explores Differentially Private Federated Learning (DP-FL) as a privacy-preserving alternative for harmful content detection. Instead of requiring direct access to user data, DP-FL trains models locally on users’ devices, ensuring sensitive information remains private while still enabling effective moderation.

The study simulates harmful text classification using data from Twitter (now X), demonstrating that DP-FL can achieve accuracy comparable to centralised models. Even when trained with fewer users or smaller datasets, the system remained effective in detecting harmful content across different types of online risks, showing strong potential for real-world implementation. The research also found that local model training required minimal computing power, making it feasible to deploy at scale without slowing down devices or demanding extensive technical resources.

By applying federated learning to content moderation, this research helps address data access barriers by enabling AI model development without centralised user data collection. It demonstrates how online safety tech providers could leverage decentralised AI to enhance online safety while maintaining compliance with privacy laws and reducing the risks associated with data sharing.

Question 6

6. Overcoming Barriers: Data Sharing Initiatives, Collaborations, and Interventions

Accepted Answer

6.1. Collaborations to overcome data barriers

There has been a notable increase in collaborative efforts related to data sharing within the online safety ecosystem. These collaborations are unlocking new opportunities for improved data availability, ensuring a more robust and scalable response to online harms. These data sharing initiatives and partnership can be summarised into five categories.

6.1.1. Collaboration among safety tech providers

In the safety tech sector, there appears to be a shift from data-driven approaches to service-driven models, particularly in the way data is shared. Instead of exchanging raw datasets, many safety tech providers now engage in ‘signal sharing’ collaborations, where processed insights - such as content classification labels or authenticity checks - are shared to improve threat detection and classification. This service-driven model allows providers to leverage key insights and AI-powered solutions without the need for extensive proprietary datasets.

These collaborations allow companies to provide targeted services and solutions that scale more effectively, without relying on direct access to sensitive or large datasets. This shift towards using shared signals, rather than raw data, promotes more agile and efficient service delivery. As a result, safety tech providers can better address emerging online harms, navigate data access barriers, and maintain privacy compliance, while fostering greater collaboration within the industry.

Case study 4: Signal sharing in AI moderation ^{[footnote 35]}

Organisations involved: Checkstep and Sightengine

Type of initiative: Signal sharing

Geography: UK / France

Specific data access / sharing barriers addressed: Lack of direct access to large, proprietary datasets for AI moderation

Type of data used: Proprietary data

Checkstep and Sightengine collaborate to enhance AI-driven content moderation by integrating GenAI and deepfake detection signals into Checkstep’s workflows. Instead of traditional data sharing, this initiative relies on signal sharing, where Sightengine provides content classification outputs without transferring raw data. This enables Checkstep to implement AI moderation at scale without requiring direct access to proprietary datasets.

By aggregating moderation signals, Checkstep automates content enforcement, supporting policy compliance and scalable risk mitigation. This service-driven approach allows platforms to leverage AI-powered moderation without needing extensive in-house data resources.

Sightengine’s pre-trained AI models detect deepfakes, AI-generated content, nudity, hate speech, and other harmful material. Checkstep integrates these insights into its workflows, enabling real-time moderation while ensuring data privacy and regulatory compliance.

By shifting from data sharing to signal sharing, this initiative helps platforms moderate harmful content effectively, addressing data access challenges while maintaining privacy-conscious, scalable enforcement.

6.1.2. Partnerships with NGOs

Partnerships with NGOs have emerged as an essential strategy for safety tech providers to address data access challenges. Global NGOs, such as the IWF, the NCMEC for CSAM, and the Global Internet Forum to Counter Terrorism (GIFCT), host online harm databases, which are crucial for training and testing content moderation and identification tools. The collaboration with these NGOs enables online safety tech providers to access curated data without directly handling sensitive materials, mitigating legal and ethical concerns.

The IWF, in particular, plays a critical role in combating online CSAM by curating and sharing valuable datasets with its members. Utilising various data collection methods, including URL,^{[footnote 36]} image hashes,^{[footnote 37]} and keywords lists,^{[footnote 38]} the IWF provides these datasets to support content moderation efforts. Approximately 200 organisations worldwide access these datasets through agreements with the IWF.

Case study 5: Using hash lists to overcome data access barriers

Organisations involved: Videntifier

Type of initiative: Hash list integration for content moderation

Geography: Global

Specific data access / sharing barriers addressed: Legal and ethical restrictions on storing or accessing harmful content

Type of data used: Curated database (pre-classified hashes of known harmful content)

Videntifier’s video and image identification platform detects and removes harmful content without storing, processing, or manually reviewing raw visual material. Instead of requiring direct access to explicit content, Videntifier utilises hash lists - unique digital fingerprints assigned to known harmful images and videos - to match newly uploaded media against reference databases. These databases include pre-classified hashes from trusted sources such as IWF, NCMEC, and the Canadian Centre for Child Protection (for CSAM) and StopNCII (for non-consensual intimate images), enabling swift detection without dataset ownership or exposing platforms and moderators to sensitive material. The platform is continuously evolving and will incorporate hash databases for additional content types, such as terrorist content and deepfakes.

Hash-sharing initiatives help overcome legal and ethical barriers to data access by allowing platforms to collaborate on content moderation without directly handling harmful material. By integrating these resources, Videntifier enables large-scale detection and removal of illegal content while addressing dataset access restrictions. Additionally, its approach enhances collaboration by providing different organisations with access to a shared repository of verified hashes, strengthening collective efforts to detect, monitor, and act against harmful content.

By leveraging pre-classified, shareable hashes, Videntifier addresses key data access and sharing barriers that often hinder content moderation. Its privacy-preserving detection model ensures scalable, accurate, and legally compliant moderation, demonstrating how hash-based solutions support safer digital environments while upholding regulatory requirements.

6.1.3. Engagement with public authorities

Collaborations with public authorities offer another valuable route to addressing data access barriers. While direct access to government-held data remains limited, regulatory engagement and partnerships provide alternative solutions.

These collaborations often involve guidance on compliance, clarification of legal frameworks, and the development of tools and best practices for data governance. Such partnerships can help safety tech providers navigate regulatory landscapes, build trust with the public, and create industry-wide precedents that improve data usage and sharing practices.

Case study 6: Regulatory backing for facial age estimation before the Age Appropriate Design Code

Organisations involved: Information Commissioners Office (ICO) and Yoti

Type of initiative: ICO Regulatory Sandbox

Geography: UK

Specific data access / sharing barriers addressed: UK GDPR compliance when using sensitive data

Type of data used: In-house data and opt-in parental data collection

Yoti joined the ICO Regulatory Sandbox ahead of the Age Appropriate Design Code to explore expanding its facial age estimation technology for children aged 6–13. While the ICO did not provide data per se, its regulatory guidance helped Yoti navigate compliance challenges related to biometric data, develop education materials, and launch a voluntary opt-in campaign for parents to provide data.

A key outcome was the ICO’s End of Sandbox exit report and subsequent update to its biometric data guidance, distinguishing facial age estimation from facial recognition and clarifying its classification under UK data protection laws. This set industry precedents, improving clarity around data use for biometric age assurance solutions and supporting the adoption of similar approaches beyond the UK.

Following its work with the ICO, Yoti expanded regulatory engagement abroad, including with Germany’s FSM and KJM, strengthening its position in age assurance. Through the sandbox, it created accessible material for young people, parents, educators, and policymakers. It also supported the ICO’s Age Appropriate Design Code audit, volunteered for a self-audit, and advanced understanding through roundtables, panels, and research - showing how collaboration can improve data governance.

Beyond this, the Sandbox strengthened Yoti’s engagement with Ofcom and international regulators shaping online safety for minors. By participating and helping clarify the role of facial analysis and biometric data, Yoti addressed data-related barriers while promoting responsible AI development.

Cross-sector collaborations for data sharing have emerged as a critical solution for overcoming data access barriers in the safety tech sector. By bringing together stakeholders from different industries, including government agencies, research institutions, and professional associations, these partnerships enable the sharing of essential data while navigating regulatory, financial, and logistical challenges.

These collaborations drive innovation and scalability by providing access to a broader range of datasets, including proprietary and research-only data, all while maintaining privacy compliance and addressing ethical concerns. They are especially valuable in areas such as age estimation, and CSAM detection, where high-quality datasets are often limited. These partnerships demonstrate the potential of collective action to drive meaningful solutions to online harms, ensuring safer, more inclusive digital environments.

The following case study explores a collaboration in the UK involving a safety tech provider, a community-owned, non-profit photo journal website, and an NGO, working together to improve access to CSAM detection data for smaller platforms.

Case study 7: Expanding access to IWF’s CSAM hash list for small platforms^{[footnote 39]}

Organisations involved: Cyacomb, IWF, and Blipfoto

Type of initiative: Privacy-preserving access to CSAM detection data for small platforms

Geography: UK

Specific data access / sharing barriers addressed: Cybersecurity requirements and financial constraints when accessing CSAM database

Type of data used: Curated database (IWF hash list)

The IWF plays a critical role in combating CSAM by providing a hash list of known illegal content to help platforms detect and remove harmful material. However, smaller platforms, such as Blipfoto, often struggle to access this data due to cybersecurity requirements and financial constraints. To bridge this gap, IWF and Cyacomb partnered with Blipfoto to pilot a privacy-preserving solution that allows small platforms to benefit from the IWF hash list without handling or storing it directly.

Cyacomb’s Safety solution acts as a secure intermediary, enabling platforms to integrate CSAM detection without requiring direct access to sensitive data. The tool operates within the platform’s infrastructure, ensuring compliance with security standards while reducing operational complexity. By preserving privacy and allowing seamless implementation, it enables small platforms to deploy CSAM detection without extensive technical expertise or security risks.

Through this initiative, Blipfoto gained access to the IWF hash list, equipping it with the same high-level CSAM detection capabilities as larger platforms while overcoming financial and technical barriers. The pilot not only strengthened Blipfoto’s ability to detect and remove harmful content but also demonstrated how scalable, cost-effective solutions can extend CSAM detection to smaller platforms. By enabling secure, controlled access to critical safety data, this project removes barriers for smaller platforms, demonstrating how broader industry participation can strengthen online safety efforts.

The next initiative highlights a cross-border and cross-sector collaboration between the UK and Switzerland, bringing together a safety tech provider, a professional association, a certification body, and a research institute to address data access challenges in age estimation and safeguard against generative AI threats.

Case study 8: Collaborative data access for safeguarding age estimation

Organisations involved: Privately, the Age Verification Providers Association (AVPA), Age Check Certification Scheme (ACCS), and Idiap Research Institute

Type of initiative: Cross-sector data collaboration

Geography: UK and Switzerland

Specific data access / sharing barriers addressed: Legal, regulatory, and logistical restrictions on accessing and sharing biometric datasets for AI model training

Type of data used: Proprietary data, research datasets, synthetic data

A collaboration between Privately, Age Check Certification Scheme (ACCS), Idiap Research Institute, and the Age Verification Providers Association (AVPA) is developing defences against generative AI threats to age estimation systems. Ensuring AI models can detect and counter manipulation attempts is crucial, but access to high-quality training data remains a challenge, as underage datasets are subject to strict privacy, legal, and ethical constraints. Funded by Innovate UK and Innosuisse, this initiative brings together a research institution, an age assurance provider, an industry association, and a certification body to overcome these barriers.

To address data access challenges, the project established structured data sharing agreements, ensuring that each partner accesses only the data essential to their role while remaining compliant with UK GDPR and Swiss data protection laws. Research institutions such as Idiap handle the most sensitive datasets under secure conditions, while Privately relies on proprietary data for model optimisation.

To further address data access constraints, the project leverages synthetic data generation, and will further explore secure enclaves, and federated learning to minimise reliance on direct dataset exchanges. These techniques enable AI model training without centralising sensitive biometric information, allowing models to learn from diverse data sources while maintaining regulatory compliance.

This initiative highlights how cross-sector collaboration can help address data access challenges in highly regulated industries. It offers a practical approach to developing AI-driven safety technologies that balance innovation, security, and regulatory compliance.

Customer-focused data sharing agreements represent a common and direct approach to addressing data needs within the safety tech sector. These agreements enable the sharing of customer data between platforms and providers, facilitating targeted training and testing of content moderation models. This direct access is particularly valuable for developing specialised tools tailored to specific challenges.

However, this approach raises important concerns regarding data privacy, security, and potential misuse, requiring robust contractual safeguards and clear ethical guidelines. While these agreements provide a general framework for data sharing, the specific terms will vary depending on the companies involved and the legal jurisdiction.

6.2. Impact of UK Government interventions

The UK Government has introduced several initiatives aimed at addressing data access and sharing challenges within the safety tech sector, including the Online Safety Data Initiative (OSDI), the Safety Tech Challenge Fund (STCF), the Privacy Enhancing Technologies (PETs) Prize Challenges, and the Deepfake Detection Challenge.

These initiatives are generally recognised by the sector, with stakeholders appreciating the government’s efforts to address key barriers. However, some safety tech providers believe that factors such as post-COVID budget constraints and unclear regulatory expectations have contributed to a cautious ‘wait-and-see’ approach within the industry, potentially slowing the progress of these initiatives.

While government support is highly valued and desired, there is a growing sense that momentum has slowed, particularly in the absence of new initiatives in recent years. As a result, many stakeholders believe that, without clear regulatory guidance and enforcement, meaningful improvements in data access are more likely to arise from commercial incentives.

Question 7

7. Conclusion

Accepted Answer

High quality, representative datasets are critical to safety technology providers working on products and services to tackle online harms. These include, amongst others, content classifiers, full content moderation services, and age assurance and age verification solutions. Facilitating better access to higher-quality data in the online safety tech sector where appropriate is essential, and has the potential to improve the effectiveness of these products and services.

This research has sought to analyse barriers, to understand the areas of continuity and change from early 2022 onwards in relation to online safety tech providers’ access to data, and to highlight new data sharing initiatives, technologies and mechanisms that can contribute to accessing higher-quality data.

Through a combination of desk research, one-to-one stakeholder interviews and an online safety tech industry survey, we have found that data barriers continue to be a major obstacle to the evolution of the sector. Compared with early 2022, our research has identified a number of areas of progress:

Evolving needs. As the landscape of online safety technology continues to evolve, there is a clear shift from focusing on data quantity to data quality. High-quality, well-labelled, and representative datasets are now in greater demand to better reflect real-world scenarios, highlighting the growing need for data that not only provides volume but also mirrors the complexities and diversity of the issues in the real world.
New technologies. Synthetic data has emerged as a new approach to address data scarcity. However, concerns around its ethical implications, particularly in the accidental creation of harmful content like CSAM, persist. PETs are being explored to allow the analysis and sharing of sensitive data without compromising privacy, with federated learning being a key focus of research and early-stage development.
Collaboration and partnerships. The online safety tech sector has seen a marked increase in collaboration and data sharing initiatives, including partnerships among safety tech providers, collaboration with NGOs, engagement with public authorities through regulatory sandboxes, and cross-sector agreements.

7.1. Areas of change compared to 2022

Increase

Data sharing initiatives: Safety tech providers have reported an increasing interest in data sharing and collaboration initiatives.
New technologies: Emerging technologies (e.g. synthetic data, OSINT, and federated learning) are being adopted by safety tech providers.
Laws and regulations: The Online Safety Act (OSA) has come into force, and the Data (Use and Access) Bill has progressed in parliament .
Access and networks: The safety tech sector has benefited from partnerships and pilots with civil society organisations.

No change

Availability of high-quality datasets: Safety tech providers continue to struggle to access and acquire high-quality, labelled data to develop and test their products.
Awareness: Limited awareness beyond the safety tech sector of the available online safety technologies and how they operate.
Cost: Accessing data is still a timely and costly exercise.
Government-backed initiatives: In 2024, the UK Government launched the Deepfake Detection Challenge, supporting better data to detect child abuse deepfakes.

Question 8

Appendix

Accepted Answer

2022 Data Landscape Review Key Takeaways

In 2022, the Department for Digital, Culture, Media and Sport (DCMS) commissioned PUBLIC to conduct an updated analysis of data access barriers faced by safety tech providers, building upon previous findings and examining new developments. This review identified key takeaways, including:

Online safety tech providers are still struggling to access and acquire high-quality, labelled data to develop and test their products.
In general, most providers had seen little change in data barriers over the previous 18 months. “The inability to access relevant, high quality, and substantial data sets” identified in the initial Data Landscape Review (March 2021) had changed little. Similarly to 2021, interviews with online safety technology providers validated that data access and sharing barriers continued to be the biggest problem facing innovators, typically broken into: a) Commercial and competition barriers; b) Legal and regulatory requirements; c) Data standards barriers; and d) Ethical and cultural barriers.

Data access and acquisition

As industry adoption of online safety tech solutions increased, there was a gradual shift from providers relying on open data to greater use of closed data. However, this also created barriers to new entrants.
The sector had evolved over the previous 18 months. The need is not just “more data”, but more of the right type of data, from platforms by content/harm type, supported by good data sharing mechanisms. While each provider was likely to have slightly different data gaps and needs, this was particularly true for firms working on harmful content which was below the illegal threshold.
While there remained appetite for agreed data classification, the majority of innovators had developed in-house taxonomies, which risked perpetuating the problem of a lack of common language and shared interpretation around online harms.
Some innovators continued to rely on open-source datasets, primarily from academic research or platforms where data access was more permitted by terms of service (i.e., Reddit). Open-source datasets continued to be of low quality, requiring laborious relabelling to enable their effective utilisation.
Access to closed databases, especially those containing Child Sexual Abuse Material (CSAM) content remained difficult for online safety tech innovators, primarily due to legal and policy barriers. When access was granted, training models tended to be labour intensive and costly, primarily due to security barriers.

The UK Government had launched a number of flagship interventions (Safety Tech Challenge Fund, Online Safety Data Initiative, Privacy Enhancing Technologies (PETs) Prize Challenges).
Online safety tech providers were beginning to collaborate with each other to overcome data barriers and combine capabilities, driven by external initiatives including the UK Government’s Safety Tech Challenge Fund and proactive collaboration initiatives.
78% of survey respondents (December 2022) were not currently participating in data sharing initiatives while 56% of respondents were interested in participating, suggesting unmet demand for new and deeper data sharing initiatives.
The UK Government-funded Safety Tech Innovation Network (STIN) and the Online Safety Technology Industry Association (OSTIA) continued to play a key role in fostering sectoral collaboration and driving investor awareness.

Solutions to overcome barriers

Research has further validated intervention opportunities around privacy-enhancing technologies (PETs) as well as generative AI to help facilitate data sharing.
Generative AI to create synthetic online harms data was seen as most valuable for earlier stage companies in the most sensitive online harms to increase data access for basic machine learning training and to reduce wellness issues from human labelling. However, there were performance and ethical challenges to employing generative AI in online safety tech, and data outputs using generative AI did not compensate for access to real-world data.

The full 2022 Data Landscape Review report is available for those interested in further information.

Research Methodology

1. Desk Research:

Key Actions:

Reviewed 55+ sources across relevant academic, industry, Government, and civil society reports, policy papers, and journal articles.
Sources were identified through targeted keyword searches (e.g. “data access”, “online safety tech”) across academic databases (e.g. Google Scholar), official website of UK public authorities (e.g. GOV.UK, Ofcom, ICO), and relevant industry and third-sector platforms (e.g. IWF, Alan Turing Institute).
Inclusion criteria: Publications dated 2021 onwards that directly addressed data access, data sharing, or the use of emerging technologies in the safety tech sector.
Research findings were analysed against the five project questions and grouped into three themes: (i) policy, market, and regulatory context; (ii) data access barriers and evolving sector needs; and (iii) initiatives addressing those barriers. This helped identify key trends, policy gaps, and changes since the 2022 Data Landscape Review.

Outputs:

Key insights and trends identified from desk research were integrated with broader research findings throughout the report to ensure robustness and relevance.

2. Expert Interviews:

Key Actions:

Targeted a mixed set of safety tech providers prioritised by geography (UK and international) and size (large and small and medium-sized enterprises). We also interviewed regulators, non-profit organisations, and industry associations.
Conducted 13 1-to-1 stakeholder interviews from a representative subsection of the safety tech sector between 28 January 2025 and 5 March 2025.
Captured insights from interviews into an affinity map and identified emerging themes, challenges and areas of continuity and change for the last two years.

Outputs:

Developed a stakeholder tracker with a breakdown of the type of safety tech provider, geography, and rationale of inclusion. The tracker also included a prioritisation for engagement and justification.
Interview scripts for each interview, which was tailored to the specific organisation engaged.
Affinity map of key findings from stakeholder interviews which was used to inform this report.

3. Industry Survey 2025:

Key Actions:

Developed a short survey in Google Forms alongside Perspective Economics in February 2025 to explore company performance and growth expectations, as well as data access and sharing challenges, and potential solutions.
Up to 31 (some conditional) DSIT-approved questions were asked in various formats, including unstructured responses, yes/no questions, multiple-choice questions, Likert scales (0-7), and ranked options.

Outputs:

9 responses, and 6 relevant to this workstream.
Key insights and trends identified from the survey were integrated with broader user research findings throughout the report and were compared with 2022 survey results where appropriate.

Acknowledgements

We would like to thank the following organisations for their support on expert interviews, survey responses, and broader research inputs:

Arwen AI Ltd
Age Verification Providers Association (AVPA)
Checkstep Ltd
Cyacomb Ltd
Information Commissioner’s Office (ICO)
Internet Watch Foundation (IWF
Office of Communications (Ofcom)
Online Safety Tech Industry Association (OSTIA)
Privately SA
Reality Defender Inc.
Resolver Ltd
SafeToNet Ltd
Streamshield Ltd
Securium Ltd
VerifyLabs.AI
Videntifier Technologies Ehf.
Yoti Ltd

Data landscape review report 2024 to 2025

1. Executive summary

1.1. Overview of the report

1.2. Key findings

1.2.2. Evolving data needs

1.2.3. Solutions to overcome barriers

2. Introduction

2.1. Background

2.2. Objectives of the report

3. The Policy and Safety Tech Market Context

3.1. Key policy and market trends

3.2. Relevant UK Government Interventions

4. Data Access Landscape

4.1. Why is data important for online safety tech providers?

5. Evolving Data Needs & Emerging Technologies

5.1. New data needs

5.2. New technologies in addressing data needs

5.2.1. Synthetic data

5.2.2. Use of open-source intelligence (OSINT)

5.2.3. Privacy Enhancing Technologies (PETs)

6.1. Collaborations to overcome data barriers

6.1.1. Collaboration among safety tech providers

6.1.2. Partnerships with NGOs

6.1.3. Engagement with public authorities

6.2. Impact of UK Government interventions

7. Conclusion

7.1. Areas of change compared to 2022

Appendix

2022 Data Landscape Review Key Takeaways

Data access and acquisition

Solutions to overcome barriers

Research Methodology

1. Desk Research:

2. Expert Interviews:

3. Industry Survey 2025:

Acknowledgements

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK

Cookies on GOV.UK

1. Executive summary

1.1. Overview of the report

1.2. Key findings

1.2.1. Barriers to data access and sharing

1.2.2. Evolving data needs

1.2.3. Solutions to overcome barriers

2. Introduction

2.1. Background

2.2. Objectives of the report

3. The Policy and Safety Tech Market Context

3.1. Key policy and market trends

3.2. Relevant UK Government Interventions

4. Data Access Landscape

4.1. Why is data important for online safety tech providers?

4.2. Barriers to data access and sharing

4.2.1. Current barriers to data access and sharing

4.2.2. What barriers to data access and sharing have been resolved?

5. Evolving Data Needs & Emerging Technologies

5.1. New data needs

5.2. New technologies in addressing data needs

5.2.1. Synthetic data

5.2.2. Use of open-source intelligence (OSINT)

5.2.3. Privacy Enhancing Technologies (PETs)

6. Overcoming Barriers: Data Sharing Initiatives, Collaborations, and Interventions

6.1. Collaborations to overcome data barriers

6.1.1. Collaboration among safety tech providers

6.1.2. Partnerships with NGOs

6.1.3. Engagement with public authorities

6.1.4. Cross-sector collaborations for data sharing

6.1.5. Customer-focused data sharing agreements

6.2. Impact of UK Government interventions

7. Conclusion

7.1. Areas of change compared to 2022

Appendix

2022 Data Landscape Review Key Takeaways

Barriers to data sharing

Data access and acquisition

Data sharing initiatives

Solutions to overcome barriers

Research Methodology

1. Desk Research:

2. Expert Interviews:

3. Industry Survey 2025:

Acknowledgements

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK