Research and analysis

Report: Enabling responsible access to demographic data to make AI systems fairer

Published 14 June 2023

1. Executive summary

The use of artificial intelligence (AI), and broader data-driven systems, is becoming increasingly commonplace across a variety of public and commercial services.[footnote 1] With this, the risks associated with bias in these systems have become a growing concern. Organisations deploying such technologies have both legal and ethical obligations to consider these risks. The White Paper on AI Regulation, published in March 2023, reinforced the importance of addressing these risks by including fairness as one of five proposed key regulatory principles to guide and inform the responsible development and use of AI.

Many approaches to detecting and mitigating bias require access to demographic data. This includes characteristics that are protected under the Equality Act 2010, such as age, sex, and race, as well as other socioeconomic attributes.[footnote 2]

However, many organisations building or deploying AI systems struggle to access the demographic data they need. Organisations face a number of practical, ethical, and regulatory challenges when seeking to collect demographic data for bias monitoring themselves, and must ensure that collecting or using such data does not create new risks for the individuals that the data refers to.

There is growing interest in the potential of novel approaches to overcome some of these challenges. These include techniques to generate synthetic training data that is more representative of the demographics of the overall population, as well as a variety of governance or technical interventions to enable more responsible data access.

Access to demographic data to address bias is important for those working across the AI lifecycle, including organisations developing, deploying and regulating AI. This report primarily explores approaches with the potential to assist service providers, i.e. those who are deploying data-driven systems (including AI) to offer a service, to responsibly access data on the demographics of their users to assess for potential bias. This has led us to focus on two contrasting sets of promising data access solutions: data intermediaries and proxies. Of course, these approaches may have relevance to other parties. However, we have not considered in detail techniques such as synthetic generation of training data, which are specifically relevant to developers.

Data intermediary is a broad term that covers a range of different activities and governance models for organisations that facilitate greater access to or sharing of data.[footnote 3] The National Data Strategy identified data intermediaries as a promising area to enable greater use and sharing of data, and CDEI has previously published a report exploring the opportunities they present.

There is potential for various forms of data intermediary to help service providers collect, manage and/or use demographic data. Intermediaries could help organisations navigate regulatory complexity, better protect user autonomy and privacy, and improve user experience and data governance standards. However, the overall market for data intermediaries remains nascent, and to our knowledge there are currently no intermediaries offering this type of service in the UK. This gap may reflect the difficulties of being a first mover in this complex area, where demand is unclear and the risks around handling such data require careful management.

If gathering demographic data is difficult, another option is to attempt to infer it from other proxy data already held. For example, an individual’s forename gives some information about their gender, with the accuracy of the inference highly dependent on context, and the name in question. There are already some examples of service providers using proxies to detect bias in their AI systems.[footnote 4]

Proxies have the potential to offer an approach to understanding bias where direct collection of demographic data is not feasible. In some circumstances, proxies can enable service providers to infer data that is the source of potential bias under investigation, which is particularly useful for bias detection.[footnote 5] Methods that draw inferences at higher levels of aggregation could enable bias analysis without requiring service providers to process individually-identifiable demographic data.

However, significant care is needed. Using proxies does not avoid the need for compliance with data protection law. Inferred demographic data (and in some cases proxy data itself) will likely fall under personal or special categories of data under the UK GDPR. Use of proxies without due care can give rise to damaging inaccuracies and pose risks to service users’ privacy and autonomy, and there are some cases in which the use of proxies is likely to be entirely inappropriate. Inferring demographic data for bias detection using proxies should therefore only be considered in certain circumstances, such as when bias can be more accurately identified using a proxy than information about an actual demographic characteristic, where inferences are drawn at a level of aggregation that means no individual is identifiable, or where no realistic better alternative exists. In addition, proxies should only be used with robust safeguards and risk mitigations in place.

In the short term, direct collection of demographic data is likely to remain the best option for many service providers seeking to understand bias. It is worth emphasising that, in most circumstances, organisations are able to legally collect most types of demographic data for bias detection provided they take relevant steps to comply with data protection law. Where this is not feasible, use of proxies may be an appropriate alternative, but significant care is needed.

However, there is an opportunity for an ecosystem to emerge that offers better options for the responsible collection and use of demographic data to improve the fairness of AI systems. In a period where algorithmic bias has been a major focus in academia and industry, approaches to data access have received relatively little attention, despite often being highlighted as a major constraint. This report aims to highlight some of the opportunities for responsible innovation in this area.

This kind of ecosystem would be characterised by increased development and deployment of a variety of data access solutions that best meet the needs of service providers and service users, such as data intermediaries. This is one area that CDEI is keen to explore further through the Fairness Innovation Challenge announced in parallel to this report.

However, this is only a partial answer to the genuine challenges in this area. Ongoing efforts by others to develop a robust data assurance ecosystem, ensure regulatory clarity, support research and development, and amplify the voices of marginalised groups are also crucial to enable a better landscape for the responsible use of demographic data.

2. Introduction

2.1 Aims of this publication

Over the last year, CDEI has been exploring the challenges around access to demographic data for detecting and mitigating bias in AI systems, and the potential of novel solutions to address these challenges. Organisations who use AI systems should be seeking to ensure that the outcomes of these systems are fair. However, many techniques for detecting and mitigating bias in AI systems rely on access to data about the demographic traits of service users, and many service providers struggle to access the data they need. Despite this, with a few notable exceptions, the topic has received relatively little attention.[footnote 6]

In this report:

  • Section 1 will set out the main barriers service providers face when seeking to collect demographic data for bias detection and mitigation.
  • Section 2 will highlight the range of potential solutions that could address these barriers, before exploring the potential benefits, challenges and opportunities associated with two promising groups of solutions: data intermediaries and proxies.
  • Section 3 will set out roles for government and other key stakeholders to create a better landscape for responsible access to demographic data.

This report has been informed by the work that CDEI has conducted over the last year, including:

  • A landscape review exploring the challenges surrounding access to demographic data for bias detection and mitigation and the potential of novel solutions to address these challenges.
  • A conjoint analysis study commissioned from Deltapoll to better understand public attitudes to data intermediaries and proxies.
  • A technical study commissioned from Frazer Nash exploring the feasibility of using proxy methods in the UK and the technical challenges they present
  • Four workshops with legal and ethical experts to explore the risks associated with the use of proxies for bias detection and mitigation and the development of associated mitigations and safeguards.

CDEI is grateful to those who contributed to these workshops, or otherwise contributed to this work.

This report has been published alongside the announcement of CDEI’s Fairness Innovation Challenge. The challenge will provide an opportunity to test new ideas for addressing AI fairness challenges in collaboration with government and regulators. We hope that it will generate innovative approaches to addressing some of the data access challenges described here.

Disclaimer: The information in this report is not intended to constitute legal advice. If you do require legal advice on any of the topics covered by this report, you should seek out independent legal advice.

2.2 Why should service providers address bias in AI systems?

The use of data-driven systems, including AI, is becoming increasingly commonplace across a variety of public and commercial services.[footnote 7] In the public sector, AI is being used for tasks ranging from fraud detection to answering customer queries. Companies in the financial services, technology, and retail sectors also make use of AI to understand customers’ preferences and predict consumer behaviour.

When service providers make use of AI systems in their services or decision-making processes, they can have direct and significant impacts on the lives of those who use these services. As this becomes increasingly commonplace, the risks associated with bias in these systems are becoming a growing concern. Bias in AI systems can lead to unfair and potentially discriminatory outcomes for individuals. In 2020, CDEI published its Review into Bias in Algorithmic Decision-Making, which explored this topic in detail.

In March 2023, the government published the White Paper on AI regulation, which included fairness as one of five key proposed principles that might guide and inform the responsible development and use of AI. The fairness principle states that AI systems should not undermine the legal rights of individuals or organisations, discriminate unfairly against individuals or create unfair market outcomes. The fairness principle considers issues of fairness in a wider sense than exclusively in terms of algorithmic bias, but addressing bias would be a key consideration in implementing it.

In some circumstances, bias in AI systems can lead to discriminatory outcomes. The Equality Act 2010 is the key UK legislation related to discrimination. It protects individuals from discrimination, victimisation and harassment and promotes a fair and more equal society. Age, race, disability, sex, gender reassignment, marriage and civil partnership, pregnancy and maternity, religion or belief and sexual orientation are all protected characteristics under the Equality Act 2010. Where AI systems produce unfair outcomes for individuals on the basis of these protected characteristics and are used in a context in scope of the act (e.g. the provision of a service), this might result in discrimination. Even when protected characteristics are not present in the training data, AI systems still have the potential to discriminate indirectly by identifying patterns or combinations of features in the data, which enable them to infer these protected characteristics from other types of data. As noted in CDEI’s Review into Bias in Algorithmic Decision-Making, ‘fairness through unawareness’ is often not an effective approach.

Service providers must address bias in AI systems to ensure they are not acting unlawfully. Public sector service providers must also have due regard to advance equality of opportunity and eliminate discrimination under the Public Sector Equality Duty (PSED). The Equality and Human Rights Commission (EHRC) has published guidance for public bodies about how the PSED applies when they are using AI, which outlines the need to monitor the impact of AI-related policies and services.

When processing personal data in relation to AI, service providers also have obligations relating to fairness under data protection law. The ICO has produced guidance on how to operationalise the fairness principle in the context of developing and using AI, as well as more targeted guidance for developers.

3. Barriers and risks

Many approaches to detecting and mitigating bias in AI systems require access to demographic data about service users. Demographic data refers to information about socioeconomic attributes. This includes characteristics that are protected under the Equality Act 2010, as well as other socioeconomic attributes such as socioeconomic status, geographic location, or other traits that might put people at risk of abuse, discrimination or disadvantage.[footnote 8]

In some cases, service user demographic data might be compared to the datasets used to train a model in order to test whether the training data is representative of the population the model is being deployed on. In other cases, service user demographic data could be used to assess performance or make standardisations to identify where a model is treating individuals from different demographic groups differently. Access to good quality demographic data about a service’s users is therefore often a prerequisite to detection, mitigation, and monitoring of bias, and an important first step in the fairness lifecycle.

However, research by CDEI and others has found that service providers currently face a range of legal, ethical and practical challenges in accessing the demographic data they need to effectively detect and mitigate bias in their AI systems. Routine collection of demographic data to improve the fairness of AI is not common practice in either the public or private sectors, except in recruitment.[footnote 9] Without the ability to access demographic data about their users, service providers are severely limited in their ability to detect, mitigate, and monitor for bias, and thereby improve the fairness of their AI systems.

3.1 Barriers

Service providers are faced with a number of barriers when seeking to collect demographic data themselves.

Concerns around public trust

CDEI’s review into bias in algorithmic decision-making found that some service providers think that the public do not want their data collected for the purpose of bias monitoring, and may be concerned why they are being asked for it.

Evidence from public attitudes research that CDEI has conducted suggests that the public’s willingness to share their data for bias monitoring varies depending on the organisation collecting it. Our 2022 Tracker Survey found that 65% of the total respondents would be comfortable providing the government with demographic data about themselves in order to check if services are fair to different groups. Further research we conducted found that 77% of the public say they are not concerned with sharing their demographic data when applying for a job.

The Tracker Survey also found that individuals were most reluctant to share their data with big technology and social media companies. Some companies have highlighted this as a key challenge, suggesting that commercial organisations may need to provide additional safeguards to demonstrate their trustworthiness.

Navigating regulatory compliance

Most demographic data is also personal, and often special category, data under UK data protection legislation.[footnote 10] This data must be collected, processed and stored in a lawful, fair and transparent manner for specific, explicit and legitimate purposes only.

Service providers must have a lawful basis and meet a separate condition for processing in order to process special category data under the UK GDPR. In some circumstances, data controllers may be required to meet additional terms and safeguards set out in Schedule 1 of the Data Protection Act 2018. Schedule 1 includes a public interest condition around equality of opportunity or treatment, which is satisfied where processing certain kinds of special category data is “necessary for the purposes of identifying or keeping under review the existence or absence of equality of opportunity or treatment between groups of people specified in relation to that category with a view to enabling such equality to be promoted or maintained”. This might provide a lawful basis for organisations to process special category data for bias detection and mitigation without requiring direct consent from individual data subjects.[footnote 11]

Despite this, navigating the existing legal framework to process demographic data for bias detection and mitigation can be complex for service providers. Uncertainty around how equality and data protection law interact in this context can lead to misperceptions about what is or is not permitted under data protection law. The CDEI’s Review into Bias in Algorithmic Decision-Making found that some service providers were concerned that collecting demographic data is not permitted at all under data protection law, or that it is difficult to justify collecting this data, and then storing and using it in an appropriate way.

The ICO recently published guidance to support service providers in navigating data protection law to address bias and discrimination in AI systems.

Data quality

When used for bias detection and mitigation, inaccurate or misrepresentative data can be ineffective in identifying bias or even exacerbate existing biases, particularly when marginalised groups are poorly represented in the data. However, collecting good quality demographic data can be challenging in practice.

Data collected directly from service users is likely to contain at least a degree of inaccuracy due to some users accidentally or intentionally misreporting their demographic traits. In addition, some users may choose to opt out of providing their data, leading to selection bias that results in a dataset that is not representative of service users. This selection bias may particularly impact individuals from groups who have experienced discrimination and marginalisation, who might be less comfortable sharing their data due to concerns about data privacy and misuse.

Data collection expertise

Collecting demographic data from service users requires establishing data collection procedures, and there is a lack of clarity around how service providers should go about doing this. Setting up effective procedures that enable the collection of good quality data may require in-house expertise, which some service providers deploying AI systems, particularly smaller organisations, may lack.

3.2 Risks

Collecting and using demographic data for bias detection can also pose risks to the individual service users.

Privacy

Due to the sensitive and personal nature of demographic data, the collection and use of this data exposes individuals to risks of privacy violations. This is particularly problematic given that detecting and mitigating bias requires data on vulnerable and marginalised groups, who may be less comfortable sharing information on their demographic attributes given their disproportionate experiences of discrimination. This has been described by some as a trade-off between ‘group invisibility’ and privacy.

Misrepresentation

When collecting demographic data, service providers have to decide which categories of data to collect and how this data will be disaggregated, and this can be challenging. Demographic categories are not static and tend to evolve over time with societal and cultural change. For example, the Race Disparity Unit recently announced that the government would no longer use the demographic category ‘BAME’ (black, Asian, and minority ethnic), as it obscures meaningful differences in outcomes across ethnic groups. Ensuring that demographic categories remain up-to-date requires that service providers regularly update the data they collect to reflect such changes.

In addition, when demographic categories are imposed on individuals, they risk misrepresenting those who do not identify with them, further disempowering groups who are often already vulnerable and marginalised. There are also ‘unobserved’ demographic characteristics, such as sexual orientation and gender identity, which can be fluid and are challenging to measure.

Data theft or misuse

The collection of demographic data by service providers increases the risk that this data is either stolen or intentionally misused. Cyberattacks by malicious actors could expose individuals to risks of information theft, which could be used for financial gain. Demographic data could also be intentionally misused by ill-intentioned actors for malicious purposes, such as identity theft, discrimination, or reputational damage. Concerns about data misuse may be particularly acute for individuals from demographic groups that have been historically marginalised or discriminated against.

4. Novel approaches

Due to the challenges that organisations face when collecting demographic data themselves, there is growing interest in novel approaches that could address these challenges and enable more widespread and responsible access to demographic data for bias detection and mitigation.

The term ‘access’ is broad, and could involve:

  • Novel methods for generating demographic data, such as proxies and synthetic data.
  • Governance or technical solutions for governing or sharing demographic data, such as data intermediaries, privacy-enhancing technologies (PETs), and data linkage.
  • Combinations of the above.

We have focused particularly on examples where a service provider is offering a service to users, and wants to understand how the outcomes of that service affect different groups. Though our interest in this area is driven by cases where a service or decision-making process is driven by data or AI, similar approaches to gathering data to monitor for potential bias are also relevant in other non data-driven contexts (e.g. monitoring the fairness of interview processes in recruitment).

This has led us to a focus on two groups of potential approaches which seem applicable: data intermediaries and proxies.

Data intermediary is a broad term that covers a range of different activities and governance models for organisations that facilitate greater access to or sharing of data.[footnote 12] This can encompass a wide range of different stewardship activities and governance models. Data intermediaries can reduce risks and practical barriers for organisations looking to access data while promoting data subjects’ rights and interests.

Proxies are inferences that are associated with and could be used in place of an actual demographic trait. For example, an individual’s postcode could be used as a proxy for their ethnicity or socio-economic status. Though the presence of proxies in algorithmic decision-making systems can be a source of bias, proxy methods could also be applied to infer the demographic characteristics of a service provider’s users to enable bias monitoring.

These two solutions offer contrasting approaches to the challenges surrounding data access, with differing opportunities and limitations. There has been significant interest in the concept of data intermediaries for some time, and there are a growing number of pilots and real-world examples of their use.[footnote 13] Despite this, data intermediaries have still not been widely adopted, nor used to enable access to demographic data for bias detection and mitigation in the UK.

By contrast, proxies offer a relatively simple and implementable alternative to collecting demographic data, but careful consideration of legal and ethical issues is needed if they are to be used. By focusing on these two contrasting approaches, we will explore the range of possibilities in this space, capturing the scope of potential benefits and challenges that novel solutions have to offer.

Some service providers, notably technology companies such as Meta and Airbnb, have started to experiment with these solutions in order to access demographic data to make their AI systems fairer. This experimentation with data intermediaries and proxies as a means of accessing demographic data demonstrates that they are perceived to be promising solutions to address the challenges surrounding data access. However, this also demonstrates an urgent need to better understand their potential and limitations.

4.1 Data intermediaries

What is a data intermediary?

Data intermediary is a broad term that covers a range of different activities and governance models for organisations that facilitate greater access to or sharing of data.[footnote 14] Data intermediaries can perform a range of different administrative functions, including providing legal and quality assurances, managing transfer and usage rights, negotiating sharing arrangements between parties looking to share, access or pool data, and empowering individuals to have greater control over their data. Intermediaries can also provide the technical infrastructure and expertise to support interoperability and data portability, or provide independent analytical services, potentially using privacy-enhancing technologies (PETs). This range of administrative and technical functions is explored in detail in CDEI’s 2021 report exploring the role of data intermediaries.

In simple terms, for the purposes of this report, a (demographic) data intermediary can be understood as an entity that facilitates the sharing of demographic data between those who wish to make their own demographic data available and those who are seeking to access and use demographic data they do not have.

Some data intermediaries that collect and share sensitive data for research already operate at scale in the UK. One example is Genomics England, a data custodian that collects sensitive data on the human genome, stores it in a trusted research environment, and grants researchers access to anonymised data for specific research projects. Another prominent example is the Office for National Statistics’ Secure Research Service, which provides accredited researchers with secure access to de-identified, unpublished data to work on research projects for the public good. As opposed to facilitating data sharing between service users and providers, these intermediaries provide researchers with access to population-level demographic datasets.

Outside the UK, there are also limited examples of intermediaries being used to steward demographic data specifically for bias auditing. The US National Institute of Standards and Technology (NIST) provides a trustworthy infrastructure for sharing demographic information (including photographs and personally identifying metadata on subjects’ age, sex, race, and country of birth) for the purposes of testing the performance of facial recognition algorithms. By providing researchers with access to sensitive personal and demographic data that enables them to quality-assure algorithms for fairness, NIST has significantly expanded the evidence base on algorithmic bias and helped developers improve the performance of their facial recognition algorithms.

Here we are focused on a different but related set of use cases. Can intermediaries help service providers to access demographic data about the individuals that interact with their service so that they can understand potential biases and differential impacts?

What could a demographic data intermediary look like?

Data intermediaries could play a variety of roles. Here, we primarily consider how they could support the collection, storage, and sharing of demographic data with service providers, though a third-party organisation could also take on an auditing role in certain circumstances.

Intermediary models have emerged in other domains where large numbers of service providers have a need to provide common functions, and doing so in a consistent way is beneficial to consumer trust, user experience and/or regulatory compliance. Examples include:

  • Online authentication, e.g. the use of tech or social media platforms as identity intermediaries to sign into other websites.
  • Online reviews, where companies often use review platforms such as TrustPilot or Feefo, providing a common interface and a level of consumer trust in how reviews are managed.
  • Online payment providers, where smaller vendors often use shared payment platforms that have the scale to achieve compliance with payment industry standards that support consumer trust and regulatory compliance.

Some of the challenges that these intermediaries address are similar in nature to those above, so it is natural to ask the question of whether a similar model could emerge to address the challenges of demographic data access.

Potential benefits

Intermediaries could offer a number of potential benefits over direct collection of data by service providers, including:

  • Expertise in navigating the regulatory landscape: The current legislative framework surrounding the processing of personal data for bias detection and correction can be difficult to navigate. Data intermediaries could help service providers to navigate the regulatory landscape by ensuring that demographic data is collected, stored and shared in compliance with data protection law. This may be particularly helpful for smaller organisations with less legal resource, for whom navigating this landscape may be particularly challenging.
  • Public trust: There is some evidence to suggest that the public may be more comfortable sharing their demographic data with some types of data intermediary than with service providers directly. Public attitudes research commissioned by CDEI indicates that 65% of respondents felt comfortable sharing their demographic data with a third-party organisation to help organisations check the fairness of their systems. Respondents felt most positively about sharing their data with a consumer rights organisation acting as a data intermediary and least positively about a large technology company acting as an intermediary, indicating that the type of organisation acting as a data intermediary significantly impacts public attitudes towards sharing their demographic data for fairness. Respondents were also more comfortable sharing their data where stronger privacy protections were in place, and had a strong preference for models that could enable them to have control over access to their own data (e.g. personal data stores).
  • Autonomy and control: Data intermediaries can give users greater autonomy and control over their data, particularly models that include governance mechanisms like data trusts or cooperatives, or enable individual control over access to data (e.g. personal data stores).
  • Privacy: Individual privacy is a key consideration for use of demographic data. However, properly managing privacy risks in large data sets is challenging, requiring appropriate specialist expertise and careful design. Data intermediaries have the potential to help here. For example, an intermediary managing demographic data on behalf of multiple service providers might be well-placed to deploy strong security and separation measures to help mitigate the privacy, security and data misuse risks surrounding the collection and use of demographic data. Use of novel privacy-enhancing technologies (PETs) such as differential privacy or homomorphic encryption could even potentially enable intermediaries to supply access to demographic data, or analysis derived from it, without sharing the underlying sensitive data.
  • User experience: Data intermediaries could provide a better user experience for service users. Service users could provide their data to a single intermediary, that could make this data available for use to multiple service providers. This could save users time and effort compared to the current landscape, in which they are requested to provide their data to multiple service providers individually.
  • Capabilities and resource: Data intermediaries should have more legal and technical capability and resource than individual service providers, enabling them to implement better data governance standards. In addition, data intermediaries that have the ability to provide auditing services themselves could help to ensure more consistent auditing standards across different service providers.

Potential data intermediary models

Various different types of organisations could act as demographic data intermediaries. For example:

  • Commercial organisations could provide data intermediation services, charging service providers to access demographic data.
  • Civil society, consumer rights or data collective organisations could act as trusted third parties with the goal of ensuring fairer outcomes for service users or specific marginalised groups.
  • Existing personal data store providers could also expand their services to collect demographic data with the explicit goal of improving the fairness of AI systems.
  • Existing providers of other intermediary services such as digital ID or independent consumer reviews could potentially offer such services alongside their existing remits.

The type of organisation acting as an intermediary might have some implications for the type of demographic data intermediary service that is offered; given the sensitivity of the data concerned and the variety of different needs, an ecosystem where multiple options are available to service providers and users seems more desirable than a single intermediary service holding large amounts of data and attempting to meet all needs.

There are a variety of different models for the role that intermediaries could play in supporting access to demographic data. To describe these potential roles, we have used the following terms:

  • A service provider is an organisation that is providing a service that they want to assess for bias. Service can be interpreted broadly; examples include public and commercial services (e.g. financial services), but also broader web platforms, recruitment services etc. where a user provides data as part of using a service.
  • A service user is an individual using that service, typically not sharing their demographic data with the service provider as part of that interaction.

There are then two different roles specifically related to demographic data which often do not exist today:

  • Data collection, storage & management
  • Bias audit, i.e. using this data to assess potential bias in a service.

In the different data intermediary models described below, these roles might or might not be played by the same organisation.

Potential model 1

One model could be for a data intermediary to collect and steward demographic data on behalf of a service provider, sharing this with them so they can audit their model for bias.

This could potentially operate as follows:

  • The user interacts with the service provider to access a service.
  • As part of this, they are offered a link to (optionally) provide their demographic data to the intermediary, signing up to some terms of data use in the process.
  • The user provides their demographic data to the intermediary. If they’ve previously provided such data, e.g. as part of another service, they could alternatively provide permission for that same data to be reused.
  • Demographic data is shared with the service provider to facilitate their bias audit (consistent with data use terms agreed with the user). Various privacy mechanisms are possible for this data exchange (e.g. aggregating demographics across a number of users, or running analysis within a closed environment run by the intermediary and only extracting results).

Diagram depicting indicative relationships and data flows for a data intermediary collecting and managing data on behalf of a service provider.

There are examples of somewhat analogous models being used to enable safe research access to population-level datasets including some demographic data, for example the ONS Secure Research Service or OpenSAFELY. We are not aware of any similar examples targeted at users of individual services, but it is feasible that a suitably trusted organisation could provide a similar kind of service collecting user or customer demographic data, and sharing this with service providers.

Potential model 2

Beyond collection and management of demographic data, there is a growing ecosystem of AI assurance service providers seeking to provide independent bias audits using such data. A potential variant of the intermediary model described above is for such a bias audit provider to also act as an intermediary collecting the data.

In this model, the intermediary acts as custodian of users’ demographic data, collecting and storing it in a secure environment (as in model 1), but then also auditing service providers’ models without ever giving them access to the data itself.

This provides an additional layer of separation between the service provider and the demographic data about their customers as compared to the previous model, i.e. they receive only results of an audit, not any demographic data itself. For this model to be feasible, a service provider will typically need to provide additional internal service-related data to the bias audit provider, and therefore is likely to require appropriate legal and technical protection for personal data and intellectual property contained within it.

Diagram depicting indicative relationships and data flows for a data.

We have identified a couple of real world examples of similar approaches to this in the context of fairness:

  • Demographic testing as part of the US National Institute of Standards & Technology Face Recognition Vendor Test programme.
  • Meta has developed an approach involving multiple third-party custodians who perform de-identified computations and return these to Meta in an encrypted format, enabling the company to test for racial bias across their products and services without ever accessing the identifiable demographic data themselves.

Potential model 3

As an alternative variant of model 1, there are a variety of intermediary models that seek to give users stronger personal control over how their data is used; variously referred to as personal data stores (PDSs), personal information management systems (PIMS) or data wallets.

Each of these offers a form of decentralised store of an individual’s personal data controlled by that individual. There are several such platforms already in existence, including commercial companies such as World Data Exchange, community interest companies like MyDex, or cooperatives like MiData, or open specifications such as SOLID.

In this context, such platforms could allow individual data subjects to manage and maintain their own demographic data and share it with service providers for bias audit and mitigation on their own terms.

Diagram depicting a service user collecting and managing their own demographic data in a personal data store, and sharing it with a service provider.

Common features

These contrasting models demonstrate the variety and breadth of data intermediaries that could support access to demographic data for bias detection and mitigation. They are not exhaustive or mutually exclusive, and their features could be changed or adapted. It is unlikely that one solution will suit every sector and group of stakeholders, and an ecosystem offering a combination of different demographic data intermediary types could be the most efficient and effective way to support the responsible use of demographic data for bias monitoring.

There are additional legal mechanisms and technical interventions that could be integrated into any of these models to provide additional protections for service users who share their data. Novel data governance mechanisms could provide service users with more autonomy over how their demographic data is used. These include data trusts (mechanisms for individuals to pool their data rights into a trust in which trustees make decisions about data use on their behalf) and data cooperatives, in which individuals can voluntarily pool their data and repurpose it in their interests. While such mechanisms have been proposed by academics for some time, there have recently been a number of schemes to pilot them in real-world settings. These include the Data Trusts Initiative’s pilot projects, the ODI’s data trusts pilots, and the Liverpool City Region’s Civic Data Cooperative. Pilots like these indicate a shift towards the development and use of novel data governance mechanisms in practice.

Data intermediaries could also integrate the use of technical interventions like PETs to provide stronger privacy and security protections. Large technology companies like Airbnb and Meta have experimented with the use of third parties to access demographic data using privacy-preserving techniques, including secure multi-party computation and p-sensitive k-anonymity, to better protect the privacy of their users.

Barriers and risks

Despite offering a range of potential benefits, such an ecosystem of data intermediaries has not yet emerged. To the best of our knowledge, there are currently no intermediaries providing services specifically designed to support service providers to access demographic data from their users to improve the fairness of their AI systems in the UK.

Our work suggests that the potential of data intermediaries to enable access to demographic data is constrained by a range of barriers and risks.

The absence of organisations offering this type of service suggests that there is not sufficient incentive for such data intermediaries to exist. Incentives might be commercial (i.e. confidence that offering such a service would be a viable commercial proposition), but might also be broader, for example an opportunity for a third sector organisation to support fairness.

What drives this absence? Demand among service providers and users for third-party organisations sharing demographic data is unclear. Given the relative immaturity of the market for data intermediaries, there may be a lack of awareness about their potential to enable responsible access to demographic data. In addition, the incentives driving data sharing to monitor AI systems for bias are primarily legal and ethical as opposed to commercial, meaning demand for demographic data intermediation services relies on service providers’ motivation to assess their systems for bias, and service users’ willingness to provide their data for this purpose.

More broadly, the market for many kinds of data intermediary is still relatively nascent. In the EU, the 2022 Data Governance Act introduced new regulations for the ‘providers of data intermediation services’, requiring them to demonstrate their compliance with conditions placed on their economic activities. The UK government acknowledged in the National Data Strategy Mission 1 Policy Framework that there is currently no established market framework for the operation of data intermediaries in the UK and has committed to support the development of a thriving data intermediary ecosystem that enables responsible data sharing. The lack of commercial incentives for data intermediaries sharing demographic data, combined with the challenges of operating in this complex area, has created little impetus for first movers.

In addition, in order to use their services to share sensitive data, service providers and users must have confidence that data intermediaries are trustworthy. Our public attitudes research indicates that one of the most common concerns members of the public have around data intermediaries is that third parties are not sufficiently trustworthy.

Non-regulatory approaches, such as data assurance, could help to build confidence in data intermediaries and demonstrate their trustworthiness. The ODI defines data assurance as “the process, or set of processes, that increase confidence that data will meet a specific need, and that organisations collecting, accessing, using and sharing data are doing so in trustworthy ways”. Research by Frontier Economics suggests the data assurance sector in the UK is nascent but growing, with approximately 900 firms currently offering a range of different data assurance products and services in the UK.

Standards could provide one way to encourage consistent data governance and management across demographic data intermediaries. This could include adoption of mature and commonplace standards such as ISO/IEC 27001, as well as other relevant data standards.[footnote 15] In addition, a number of relevant certification and accreditation schemes already exist, such as CoreTrustSeal, FairData and the MyData Global Operator Award. These could help data intermediaries sharing demographic data to demonstrate their adherence to data protection and ethical standards.

Despite the burgeoning ecosystem for data assurance in the UK, work to understand how assurance products and services could demonstrate the trustworthiness and support the uptake of data intermediaries is in its early stages. No one standard-setting body or certification scheme can cover all the areas required to effectively assure data intermediaries and, given their diversity, a one-size-fits all approach is unlikely to be appropriate. For this reason, a greater understanding of how existing assurance products and services can demonstrate the trustworthiness of data intermediaries, how well these can meet the needs of different stakeholders, and where there may be remaining gaps in the data assurance ecosystem is required. This could support third parties sharing demographic data to demonstrate their trustworthiness, and encourage uptake among service providers and service users.

Of course, intermediaries gathering sensitive data of this nature must contend with many of the same challenges that a service provider would have managing the same data. Where data intermediaries are collecting and storing sensitive demographic information about service users, they still need to take steps to minimise the risk that personal data is stolen or intentionally misused; our public attitudes research found that data theft and misuse was a common concern among respondents in relation to data intermediaries. In addition, any third party collecting demographic data must ensure the data they collect is good quality. Much like service providers seeking to collect this data themselves, third parties must contend with similar challenges around data accuracy and representativeness.

Conclusions

Data intermediaries hold real promise as a means to enable responsible access to demographic data for bias detection and mitigation. They could promote the collection and use of demographic data in ways that supports the regulatory compliance of service providers, protects service user privacy and autonomy, and elicits public trust, while providing better user experience and higher standards than service providers collecting this data themselves.

Despite this potential, a market for such services is yet to emerge. We have discussed some of the barriers to this above, but for a service provider that could benefit from this approach, the absence of third parties offering such services in the UK prevents this being a straightforward option at present.

Longer term, there remains  clear potential for intermediaries to play a useful role. There is a need for piloting potential solutions in this area to support the development of the market for data intermediaries, and demonstrate the opportunities to service providers and users that might use them. This is one area that CDEI hopes to explore further in the Fairness Innovation Challenge announced in parallel to this report.

4.2 Proxies

What are proxies?

Many organisations hold a range of data about individuals that they provide services to. However, for the reasons discussed above, they often do not hold some or all of the demographic data that they need to audit their own systems and processes for bias.

In contexts where collecting this demographic data directly is hard, an alternative is to infer it from data that you already hold, using proxies for the demographic traits that you are interested in.

Proxies can be a source of discrimination in AI systems where algorithms are able to deduce protected characteristics from relevant data points. For example, most insurance pricing models include postcode as a factor for a variety of valid reasons. However, the mix of ethnic groups varies significantly between different postcode areas, and there is a risk that insurance pricing models indirectly treat individuals from certain ethnicities differently via the proxy of their postcode. This is one reason why monitoring for potential bias is increasingly important as AI systems become more complex and sophisticated.

Conversely, there is potential for proxies to be used to detect and address bias in AI systems. Using data they already hold as proxies, service providers could infer the demographic traits of their service users, and use this data to detect bias in their AI systems.

Examples of this include:

  • Forename as a proxy for gender.
  • Postcode as a proxy for socio-economic status or ethnic group.
  • Perceived ethnicity of a facial image as a proxy for the ethnic group of the individual.

The inferences you might make about demographic traits from such proxy data will inevitably not be fully accurate, and whether this accuracy is enough to be practically useful for bias monitoring will be dependent on both the proxy data that is available, and the use case.

Proxies raise a number of challenging ethical and legal concerns. These are discussed in more detail below.

Existing proxy methods and tools

There are a wide range of proxy methods and tools in existence, varying from relatively simple methods to more complex machine learning approaches. These can be used to infer different demographic attributes, although ethnicity and gender have been the most common target variables to date. Many proxy methods and tools involve inferring the demographic traits of identifiable individuals (see Example 2 below). However, some approaches avoid this by using personas (see Example 1 below) or by drawing group inferences in such a way that ensures individuals are not identifiable.[footnote 16]

Many of the most popular proxy methods and tools have been developed in the US, although some have been trained on large datasets spanning multiple geographies. These methods and tools vary in their accessibility to service providers, with some available open source and other commercial tools requiring payment for access.

Some of these methods and tools were developed specifically to assess bias and discrimination, such as RAND’s Bayesian Improved Surname Geocoding(BISG). In recent years, a few prominent technology companies including Meta and Airbnb have begun to pilot more advanced, privacy-preserving proxy methods with the explicit aim of generating demographic data to improve the fairness of their AI systems.

The examples below provide three contrasting approaches to using proxies, demonstrating the breadth of possibilities for their use to enable bias monitoring.

Example 1: Citizens Advice using name and postcode to infer ethnicity

In 2022, Citizens Advice conducted exploratory research to better understand whether people from ethnic minority backgrounds experience worse outcomes in the car insurance market than white consumers. To measure this, they conducted mystery shopping using 649 personas that varied by name and postcode, comparing the prices paid by shoppers with names that are common among people from different ethnic backgrounds and postcodes with different proportions of ethnic minority communities in the population.

They found no significant difference in prices charged to people with different names in the same postcode area. However, average quotes were higher in areas where black or South Asian people make up a large proportion of the population, and this could not be explained by common risk factors such as crime rates, road accidents or levels of deprivation in the area.

By using personas, Citizens Advice was able to assess a service for bias without requiring access to the personal data of service users. Although their methodology allowed them to test the outcomes of pricing mechanisms, Citizens Advice acknowledge that it cannot explain exactly why the outcomes they identified occurred.

Example 2: Airbnb’s Project Lighthouse using first name and photos of faces to infer perceived race

Airbnb’s Anti-Discrimination product team have developed a privacy-by-design approach to infer the perceived race of their service users using their first name and an image of their face. By measuring inequities on the basis of perceived race, they aimed to account for the fact that discrimination often occurs because of people’s perceptions of one another’s race as opposed to their actual race.

The team sent a k-anonymized version of this service user data to a research partner organisation, who was under a confidentiality agreement and had their systems reviewed by Airbnb security. The research partner assigned perceived race to service users and this data was returned to Airbnb, who perturbed the data to achieve a level of p-sensitivity (i.e. ensuring that each equivalence class in the dataset had at least p distinct values for a sensitive attribute) before storing it. Finally, this p-sensitised k-anonymised dataset was used to measure the acceptance rate gap between different perceived racial groups.

By including a research partner and making careful use of privacy techniques, Airbnb’s approach enables them to analyse whether hosts exhibit bias on the basis of perceived race while protecting the privacy of service users.

Example 3: NamSor using first name and surname to infer gender and ethnicity

NamSor is a commercial product which uses machine learning to infer ethnicity and gender from first names and surnames. Namsor SAS, the company who owns the product, suggests it can be used to measure gender or ethnic biases in AI-driven processes, and they offer a range of tools to suit different customers, including API documentation, CSV and Excel files analysis, and developer tools.

NamSor has processed over 7.5 billion names and is continually maintained with new training data. The company claims it is the most accurate ethnicity and gender inference service in the world. One independent, comparative study supports this claim, suggesting the tool achieves an F1 score of 97.9%.

Potential benefits

There are a range of reasons why a service provider or technology developer might be motivated to use proxies rather than collect data directly.

  • Utility: In some cases, a proxy may be more relevant and useful for bias detection than the demographic trait itself. For example, Airbnb worked with a third-party organisation who inferred perceived race from the images of the faces of service users (i.e. guests). In this case, the perceived race of service users was arguably more relevant than their actual race, as it is hosts’ perception of guests’ race that was the source of bias that they were seeking to investigate.
  • Convenience for service providers: Proxy methods and tools rely on the use of relevant data that service providers may already hold. This can eliminate the time and resource service providers would otherwise have to spend collecting demographic data directly from users, which could make them a relatively convenient approach. However, service providers will still be required to comply with relevant data protection obligations, such as the purpose limitation principle.
  • User experience: Using proxies enables service providers to make use of data they already hold, eliminating the need for service users to share their demographic data directly with numerous different service providers who request it. This can enable the generation of demographic data for bias detection and mitigation in a way that is relatively unburdensome for service users.
  • Data quality: If asked to share their sensitive data directly, service users may provide inaccurate information or not provide their data at all, resulting in a poor quality demographic dataset that has low utility for bias detection purposes. Where service providers already hold relevant datasets of good quality, using proxies could enable them to produce inferences that are statistically accurate and more representative than the data they are able to collect directly from their users, enabling them to conduct more accurate analysis of potential biases.

Barriers and risks

Despite some potential benefits, the use of proxies presents a number of legal and ethical risks, and practical challenges.

Legal risk

Most demographic data inferred through the use of proxies is likely to be classified as personal or special category data under the UK GDPR, and must be processed in accordance with data protection legislation.

The ICO’s guidance around the legal status of inferred data states that whether an inference counts as personal data or not depends on whether it relates to an identified or identifiable individual. In addition, it may also be possible to infer or guess details about someone which fall within special categories of data.[footnote 17] Whether or not this counts as special category data will depend on the specific circumstances of how the inference is drawn.

Given that the use of proxies to generate demographic data for bias detection involves the intentional inference of relevant information about an individual, proxy methods will likely involve the processing of special category data, regardless of whether these inferences are correct or not.

Where proxies are used to infer demographic traits at a higher level of aggregation, such that inferences are drawn only about so-called ‘affinity groups’ and not specific individuals, the ICO states that these inferences may also count as personal data depending on how easy it is to identify an individual through group membership. When using proxy methods to draw group inferences, service providers should still comply with the data protection principles, including fairness.

The use of proxies may pose additional legal risks for service providers where they are unaware of their legal obligations with respect to inferences or find them difficult to interpret and apply in practice.

Accuracy

Proxies can generate inaccurate inferences which can obscure or even exacerbate bias in AI systems when used for bias detection and mitigation. Our public attitudes research suggests the accuracy of proxy methods is a key concern for members of the public. There are a number of distinct issues related to the accuracy of proxies.

  • Reliability: Our research commissioned from Frazer Nash found that reported accuracy rates for many proxy methods are unreliable, as they are often self-reported and not performed on the same dataset. Where a proxy method is applied to a specific dataset to produce inferences for bias detection, it may exhibit lower accuracy rates than those reported by developers.
  • Variability: Accuracy rates can also obscure poorer performance for minority demographic groups, who are ordinarily those who have been historically marginalised and are most at risk from algorithmic bias. For example, one independent, comparative study of gender detection tools found that all three tools they assessed inaccurately predicted the gender of individuals with Chinese given names in Pinyin format. The use of such tools by UK service providers may therefore exacerbate existing biases against ethnic minorities. Research commissioned from Frazer Nash found that for demographic traits with a greater number of categories (e.g. ethnicity), the risk that proxy methods are less accurate for minority demographics increases. Variable accuracy rates across different groups are particularly problematic for intersectional bias, as the demographic traits of individuals who fall into multiple marginalised groups are least likely to be correctly inferred and accurately represented in the data.
  • Performance metrics: Variation in model performance across demographics is often obscured by choice of performance metrics. Research commissioned from Frazer Nash found that classification accuracy is a commonly used performance metric but it does not reveal the frequency with which a proxy method has returned false positives or false negatives, despite this having a potentially significant impact on outcomes when the method is used in a particular context. In some cases, a model with lower classification accuracy may perform better when used to infer demographic data for bias detection purposes than a method that scores more highly.
  • Concept and model drift: The accuracy of any proxy method is also likely to change over time as the UK demographic landscape shifts. Increased multiculturalism, for example, will lead to altered ethnicity distributions over time, resulting in ethnicity data used in developing the method becoming outdated. Methods used to infer demographic traits that are more fluid and heterogeneous are likely to be particularly vulnerable to these kinds of drift. For a gender-prediction model using images of faces as inputs, for example, a growing trend towards men wearing more make-up (which the model may historically associate with women) could lead to poorer performance over time. Some tools, such as Namsor and Gender API, account for model drift through regular retraining on continuously updated datasets but others do not. However, models which learn continuously require greater investment of time and resource, and may be more difficult to assure due to increased security risks. In practice, the difficulty of predicting and accounting for these kinds of demographic shifts may result in service providers relying on outdated and inaccurate inferences.
  • Unobservable traits: There are some demographic traits which are likely to be inaccurate and inappropriate to infer using proxy methods. Traits that are ‘unobservable’, such as sexual orientation, are particularly challenging and controversial to infer using other data sources.
  • Legal risk: Inaccuracies in the inferences drawn from proxies may also have legal implications, as improving the statistical accuracy of an AI system’s outputs is one factor to be considered within the fairness principle in data protection law. The ICO’s Guidance on AI and Data Protection provides further guidance in this area.

Privacy

The use of individual-level proxies may interfere with service users’ privacy as they reveal personal information about them. Privacy was a key concern about proxies among participants in our public attitudes study.

The inference of some demographic traits may not interfere with privacy much, if at all. However, information relating to more sensitive demographic categories, which form part of the individual’s private life, could seriously impede on the privacy of service users. This is supported by evidence from the public attitudes study, which found that members of the public are more comfortable with organisations inferring their age than they are other demographic traits, such as disability status or sexuality. The sensitivity of demographic traits may also be compounded by other contextual factors, like the individuals’ attributes (e.g. if they are a child or otherwise vulnerable) or their circumstances (e.g. if they live in a homophobic environment).

Transparency and user autonomy

The use of proxies to infer demographic data is inherently less visible to service users than collecting demographic data directly from them. The low visibility of proxy use raises concerns around transparency and service users’ autonomy. When processing personal or special category data for bias monitoring, service providers have obligations related to transparency under the UK GDPR. The ICO has provided guidance on the right to be informed, which is a key transparency requirement under the UK GDPR.

Public trust

Proxies are a controversial topic, and the public appear to be less comfortable with their use than with providing their data to a third party. Our public attitudes study indicated that only 36% of respondents were fairly comfortable with the use of proxies, and 23% were uncomfortable. Levels of public comfort varied depending on the type of proxy, the target demographic trait, and the type of organisation using the proxies. Members of the public were particularly concerned about their privacy, the accuracy of the inferences, and the risks of data misuse.

Accessibility

The use of proxy methods relies on access to relevant proxy data. The type of proxy required will vary depending on the target variable but could include service user postcodes, names, social media data, or facial photographs. Some of this data may already be held by service providers but some may not. The accessibility of proxy data will place limitations on the applicability of different proxy methods.

Data quality

When used for bias detection, poor quality data can be ineffective in detecting biases or even introduce new ones, particularly when marginalised groups are poorly represented in the data. The ability to draw inferences that are useful for bias detection purposes therefore relies on access to good quality proxy data.

Where service providers do hold data that could be used to infer demographic traits of interest, this data may be incomplete or inaccurate. Where poor quality proxy data is used to infer demographic information about service users, it will produce poor quality inferences. This raises related concerns around compliance with the accuracy principle under data protection law, which applies to input as well as output data.

Using proxies responsibly

Proxies offer an alternative approach to accessing demographic data for bias detection and mitigation. Proxies can be a practical approach to bias detection for service providers who already hold relevant data, and can prevent the need for service users to provide their demographic data numerous times to different organisations. In some circumstances, proxies may be the best way for service providers to effectively analyse their AI systems for bias, particularly where the proxy is more helpful in identifying bias than the demographic trait itself. Methods that rely on personas or group inferences at a level of aggregation such that individuals are not identifiable may pose few privacy risks to individual service users.

Despite this, the use of proxies poses a number of legal and ethical risks, as well as practical challenges. There are some cases in which the use of proxies is likely to be entirely inappropriate and should be avoided. Other methods, although not illegal, will likely involve the processing of special category data, which may entail legal risk for service providers. In addition, proxies can give rise to damaging inaccuracies and pose challenges to the privacy and autonomy of service users, and members of the public appear to be less comfortable with their use than other data access solutions.

Proxies are therefore likely to be a viable solution to enable access to demographic data for bias detection only in certain circumstances, such as when bias can be more accurately identified using a proxy than information about an actual demographic characteristic, or where inferences are drawn at a level of aggregation that means no individual is identifiable. In addition, proxies should only be used with robust safeguards and risk mitigations in place.

Here, we set out the key ethical issues for service providers to consider when seeking to use proxies for bias detection and mitigation. Alongside these ethical considerations, service providers using proxies should consider their legal obligations by referring to the ICO’s Guidance on AI and Data Protection, including Annex A ‘Fairness in the AI Lifecycle’.

Step 1: Establish a strong use case for the use of proxies as opposed to other alternatives

This is central to ensuring the ethics of using proxy methods, and helps service providers to exclude the use of proxies where a reasonable, less intrusive alternative exists.

There are certain demographic traits for which the use of proxies is not advisable. In particular, where service providers wish to test the system for bias relating to demographic traits that are unobservable, such as sexual orientation, they should seek an alternative approach.

However, there are a limited number of scenarios in which the use of proxies to address bias may be justifiable. These include:

  • Where the inference is more relevant or useful in detecting bias than the demographic trait itself. For example, where perceived ethnicity is the source of bias in the AI system as opposed to ethnicity itself.
  • Where the method does not involve the processing of personally-identifiable information. For example, where proxies relate to personas or groups of individuals (where individuals cannot be individually identified).
  • Where no better alternative to access demographic data exists.

The strength of these justifications should be weighed up in light of the risk that the AI system in question is biased, and the severity of the real-world impact of this bias. To make this assessment, knowledge of the context in which the AI system is being deployed is critical, and service providers should engage with civil society organisations and affected groups in determining whether using proxies is appropriate in any given use case.

Service providers should also refer to the ICO’s Guidance on Data Protection and AI at this stage to establish that their proposed use of proxies is lawful.

Step 2: Select an appropriate method and assess associated risks

If a strong case for the use of proxies as opposed to other alternatives has been established, service providers need to select an appropriate proxy method and assess the risks and trade-offs associated with its use. There are a number of commercial tools and open source methods available to service providers. A non-exhaustive list of some methods that are applicable in the UK context can be found in the technical report by Frazer Nash.

When selecting a method, service providers should consider:

  • Determining whether the method or tool involves drawing inferences about personally-identifiable individuals. ICO guidance can help to identify where inferences may qualify as personal or special category data under UK GDPR.
    • Where the method or tool involves drawing inferences about personally-identifiable individuals, service providers should refer to ICO guidance to ensure their processing of personal or special category data complies with data protection law.
    • Service providers should also think carefully as to whether risks to individuals can be mitigated. They should consider whether they have the ability to make use of privacy safeguards, such as privacy-preserving techniques (see ‘Design and develop safeguards’ below). They should also consider whether the granularity of the inferences generated by the proxy method or tool are necessary in light of the specific use case and requirement.
  • Testing the performance of proxy methods by conducting an independent review using a representative dataset to determine which may be most appropriate to use.

  • Looking at historic data and current social and cultural trends to make predictions about likely model drift, and consider its implications for the need for model retraining or continuous learning.

  • Assessing whether available information about the proxy method or tool enables them to use it confidently and to be meaningfully transparent about how they’re using it.

Alongside these considerations, service providers need to assess the feasibility of using the method or tool, including factors such as the availability and cost of the method or tool, the availability and quality of proxy data, and available resources and expertise within the organisation.

Service providers should also consider conducting a risk assessment to assess the risks and trade-offs associated with the use of this method in the specific context they intend to use it in. They should also carefully consider the limitations of the method or approach they have chosen, and whether there are further actions they can take to overcome these limitations.

Step 3: Design and develop robust safeguards and risk mitigations

If an appropriate method is chosen and the risks and limitations of this method have been identified, service providers should consider the development of risk mitigations and safeguards, including:

  • Measures to ensure model accuracy, such as regular monitoring of model performance and retraining or revalidation of the model at appropriate intervals.

  • Transparency measures aimed at service users, including:
    • Active communication of privacy information.
    • Providing a time lag between notification that an inference will be made and actually making the inference.
    • Providing an opportunity for service users to ‘opt-out’.
    • Communicating the inference itself to the data subject.
    • Following existing guidance from supervisory authorities on the use of layered privacy notices, privacy dashboards, and just-in-time notices.
  • Transparency measures aimed at the broader public, including:
    • Meaningful public disclosure of impact assessments.
    • Consultation, co-design, and/or testing of approach to using proxies with civil society organisations representing affected groups.
    • Publishing blogs or papers explaining the choice to use proxies for bias detection, its benefits and risks for service users, and the safeguards in place.
  • Privacy-preserving techniques:
    • Select techniques that can be integrated with the chosen method. CDEI’s interactive PETs adoption guide and the ICO’s draft guidance on PETs could both help service providers to assess which techniques are most suitable.
    • Conduct experiments to find an appropriate level of noise or blur to address the risk that noise or blurring impacts model usefulness.
    • Some privacy-preserving techniques also risk re-identification where masked data is reverse engineered. This can be addressed through use of blurring techniques, and service providers can apply the motivated intruder test to check for risks of re-identification.

No set of safeguards will entirely eliminate the risks associated with the use of sensitive data and there will always be a degree of residual risk. Service providers should consider and document what that residual risk might look like, and whether it is proportionate compared to the established benefits of using the proxy method. This assessment would again benefit from engagement with civil society and affected groups.

If residual risks are deemed acceptable given those benefits, the last step is to implement safeguards and proceed with the use of proxies. Otherwise, service providers may need to consider whether further safeguards might be required, or whether the use of proxies is justifiable at all. Residual risk should also be reviewed on an ongoing basis to ensure new risks associated with changes in context are captured and mitigated.

5. Enabling a better landscape

The current landscape of options for accessing demographic data is not ideal, and has significant limitations. Organisations are required to navigate significant legal, ethical, and practical challenges to either collect demographic data or infer it via the use of proxies. Evidence suggests that members of the public are likely to feel more comfortable sharing their data when governance mechanisms offer them greater privacy and control over their demographic data, particularly in sectors where levels of public trust in data sharing are lower.

In this section, we reflect on what needs to happen to improve this ecosystem, and make it easier for organisations to responsibly use demographic data to address bias.

This requires the development and scaling up of ambitious data access solutions that best mitigate ethical risks, are most practical for service providers and users, and are trusted by members of the public. Data intermediaries are one promising area for further development, as well as complimentary governance mechanisms like data trusts and technical interventions such as privacy-enhancing technologies.

5.1 Role of government and regulators

As government, we have a key role to play in spurring responsible innovation in this area, and a variety of work is underway to support this.

Firstly, although demographic data can already be legally processed by service providers for bias detection and mitigation, some organisations may find that the existing data protection framework is complex and difficult to navigate in this area.

In September 2021, the government launched a consultation on reforms to the UK’s data protection laws, including seeking views on provisions relating to processing personal data for bias detection and mitigation. Respondents agreed there should be additional legal clarity on how sensitive data can be lawfully processed for bias detection and correction, and some felt that introducing a new processing condition under Schedule 1 of the Data Protection Act 2018 would be beneficial. As outlined in the government response, the government is introducing a statutory instrument to enable the processing of sensitive personal data for the purpose of monitoring and correcting bias in AI systems, with appropriate safeguards. This measure fits in the wider approach the government is developing around this issue, as proposed in the White Paper on AI regulation, currently out for consultation.

Regulators have already published relevant data protection and equalities guidance related to AI, as well as some data access solutions, including proxies and privacy-enhancing technologies. However, greater clarity around service providers’ equality obligations with respect to detecting and mitigating bias in their AI systems would be welcome, and could further incentivise service providers to take action to improve the fairness of their systems. Continued regulatory collaboration between the ICO, EHRC and relevant sectoral regulators will also be critical moving forward to ensure the responsible collection and use of demographic data to improve the fairness of AI systems, particularly where novel solutions to generate and share this data are being tested.

The government also has an important role to play in incentivising innovation and supporting the development and scaling up of promising solutions to enable responsible access to demographic data. As announced alongside this report, the CDEI plans to run a Fairness Innovation Challenge to support the development of novel solutions to address bias and discrimination across the AI lifecycle. The challenge aims to provide greater clarity about which data access solutions and AI assurance tools and techniques can be applied to address and improve fairness in AI systems, and encourage the development of holistic approaches to bias detection and mitigation, that move beyond purely technical notions of fairness.

In the National Data Strategy, the government also committed to support the development of a thriving data intermediary ecosystem by considering the role of competition, horizontal governance structures, and strategic investment in intermediary markets. The ongoing work in this area could serve to support the emergence of intermediaries that are able to play a useful role in this area.

One specific area of focus for this work relevant here is support for the development of a data assurance ecosystem to ensure that new data access solutions, particularly data intermediaries, are trustworthy. There is a burgeoning ecosystem for data assurance in the UK but work to understand how such services could demonstrate the trustworthiness and support the uptake of data intermediaries is in its early stages. The ODI has published research exploring the data assurance landscape in support of the government’s National Data Strategy. Further research could explore the extent to which the existing data assurance market can engender confidence in new data access solutions and meet the needs of different stakeholders, and identify potential gaps in the ecosystem.

5.2 Role of service providers

As discussed above, service providers should already be taking action to identify and address bias in AI systems that they deploy.

Those seeking to collect demographic data themselves should refer to guidance from the ICO around processing of personal data, including special category data, to ensure their collection and use of demographic data is legally compliant. In addition, the ONS has issued guidance around some demographic categories that service providers could use when seeking to collect data for the purposes of measuring equality. Service providers should give consideration to the ways in which they can mitigate the risks associated with demographic data collection, for example, by using more participatory and inclusive approaches to data collection.

In some cases, proxies may be a more suitable alternative to collecting demographic data themselves. Service providers should refer to the key ethical considerations in the previous section of this report, as well as the ICO’s Guidance on AI and Data Protection and other sector-specific guidance, to determine whether such approaches are appropriate and, if so, how they could be used responsibly.

Given the growing imperative on many service providers to access demographic data, they should demand solutions from the market that better meet their needs, and the needs of their users, by embedding ethical best practice, supporting them to navigate regulation, and providing more practical services. Novel data governance approaches, such as data intermediaries, alongside complementary governance and technical interventions, could help to meet service providers’ needs, and demand for these solutions could stimulate innovation in these areas.

5.3 Role of researchers and civil society

There is also an important role for the research community in providing continued research into and piloting of solutions to enable responsible demographic data access. More comparative studies of proxy methods using the same test datasets and performance criteria, particularly filtered or weighted accuracy scores, could help service providers to make better informed decisions as to whether such methods are sufficiently accurate for different demographics and acceptable for use to make assessments about fairness.
Data quality is also a persistent challenge whether service providers collect demographic data themselves or access it using a generative method or through a third party. Further research into and piloting of novel approaches to improve data quality, such as participatory approaches to data collection, would be beneficial.

Finally, civil society groups have a key role to play in informing and mobilising members of the public and ensuring that solutions and services to enable responsible access to demographic data protect their rights and interests. Demographic data is of vital importance in detecting and correcting bias in AI systems, yet the collection and use of this data poses risks to individuals, particularly those from marginalised groups. Civil society groups can help to raise awareness among individuals, including members of marginalised communities, about the importance of access to demographic data in tackling bias, whilst simultaneously calling for the development of solutions and services that give people greater autonomy, protect their privacy, and are worthy of their trust. Crucially, civil society groups can also help to amplify the voices of marginalised communities in debates around the design and development of new solutions, ensuring they are consulted and their views accounted for.

  1. For example, see Bank of England, ‘Machine Learning in UK Financial Services’, Local Government Association (LGA), ‘Using predictive analytics in local public services’, and NHS England, ‘Artificial Intelligence’. 

  2. See the EHRC’s ‘Five components of data collection and analysis’ (pg. 54) in ‘Measurement Framework for Equality and Human Rights’

  3. Definition from CDEI’s report, ‘Unlocking the value of data: Exploring the role of data intermediaries’. 

  4. Meta, ‘How Meta is working to assess fairness in relation to race in the U.S. across its products and systems’, Airbnb, ‘Measuring discrepancies in Airbnb guest acceptance rates using anonymized demographic data’. 

  5. See Airbnb, ‘Measuring discrepancies in Airbnb guest acceptance rates using anonymized demographic data’, where Airbnb assessed bias on their platform on the basis of ‘perceived race’. 

  6. Notable exceptions include the Partnership on AI’s Workstream on Demographic Data, as well as some academic scholarship, including Michael Veale and Reuben Binns, ‘Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data’ (2017). 

  7. For example, see Bank of England, ‘Machine Learning in UK Financial Services’, Local Government Association (LGA), ‘Using predictive analytics in local public services’, and NHS England, ‘Artificial Intelligence’. 

  8. See the EHRC’s ‘Five components of data collection and analysis’ (pg. 54) in ‘Measurement Framework for Equality and Human Rights’

  9. See CDEI, ‘Review in bias in algorithmic decision-making’ and Open Data Institute (ODI), ‘Monitoring Equality in Digital Public Services’. 

  10. Some protected characteristics, including race, ethnicity, disability, and sexual orientation, are also special category data under the UK General Data Protection Regulation (GDPR) and the Data Protection Act 1998. 

  11. The equality of opportunity condition (Schedule 1, 8.1(b) of the Data Protection Act 1998) does not cover all special category data (e.g. trade union membership is not included). 

  12. Definition from CDEI’s report, ‘Unlocking the value of data: Exploring the role of data intermediaries’. 

  13. Pilots include those by the Open Data Institute (ODI), Data Trusts Initiative, and the Liverpool City Region Civic Data Cooperative

  14. Definition from CDEI’s report, ‘Unlocking the value of data: Exploring the role of data intermediaries’. 

  15. Including, for example, the ISO/IEC CD 5259-1 series, which is currently under development. 

  16. The ICO provides guidance about when group inferences are personal data. 

  17. Special category data includes personal data revealing or concerning data about a data subject’s racial or ethnic origin, political opinions, religious and philosophical beliefs, trade union membership, genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health, sex life and sexual orientation.