How to implement principle 3 of the Data Ethics Framework for the public sector.
Your use of data must be proportionate. Whether your proposed data collection, storage or analysis is proportionate will depend on:
- the user need and expected public benefit, which according to Principle 1, you should be able to clearly demonstrate
- the type of data (personal or non-personal)
- whether personal data can be de-identified, also known as pseudonymising or anonymising
- how and where you got the data
If you decide your proposed data use isn’t proportionate, then either:
- change your data source, collection mechanism or analysis technique so that it becomes proportionate
- consider if there’s some other way to meet your user need (like qualitative user research)
You must not proceed with your project if your data use is not proportionate to the user need.
When deciding if a use of data is proportionate, you should document the process used to determine this and any supporting evidence.
Personal data and proportionality
De-identifying personal data before use
When using personal data it is good practice to work with data that has been de-identified to the greatest degree possible.
This process is often called anonymisation, although it is important to note that many forms of data can never be fully and irreversibly anonymised. Pseudonymisation is a related process of removing the most re-identifying components of the data and storing them separately.
If data is anonymised to the greatest degree possible, it is likely to be out of scope of data protection law as it is no longer considered personal data. Pseudonymised data, however, is subject to the same laws as fully identifiable personal data. Recital 26 of the GDPR provides further information.
If you plan to anonymise or pseudonymise personal data before linking or analysis, make sure you follow the ICO’s Anonymisation: managing data protection risk code of practice and document your methods.
You can find more technical advice in the UK Anonymisation Networks anonymisation guidance.
You may also be able to use synthetic data, replacing sensitive data with a set of plausible values, where individual citizens would not be identifiable. Synthetic data is becoming increasingly popular for training machine learning models while preserving individuals privacy. The model can then be used to make predictions on real data. Synthetic data is only as good as the underlying generative methods used to produce it which can impact the performance of the model on making predictions on real data. This can be a useful approach when working on very sensitive data or specific machine learning applications such as computer vision.
It is important to remember that pseudonymising or anonymising data does not make it automatically appropriate to use. It is possible to make incorrect inferences and develop potentially intrusive or damaging policies based on less identifiable data. Considering the Data Ethics Principles holistically should help you make a decision on whether your approach is proportionate and appropriate.
You should assess the proportionality of your proposed approach i.e. whether it is necessary and appropriate. The following questions must be answered in multidisciplinary teams made up of data practitioners and subject matter and operational experts. Evidence to support your decision should be recorded and accessible to any individual joining the project at a later date.
- is the measure suitable to achieve the aim?
- is the measure necessary to achieve the aim?
- would the proposed use of data be deemed inappropriate by those who provided the data?
- would the proposed use of data for secondary purposes make it less likely that people would want to give that data again for the primary purpose it was collected for?
Some other ways to understand if the data you intend to use is proportionate to the user need are:
- user research or speaking to the public about your plans
- getting advice from ethics committees and individual experts, like the National Statistician’s Data Ethics Advisory Committee
- reviewing public dialogue studies on how citizens feel about data use, including the Government Data Science Partnership and Ipsos Mori review of government use of data science and the Royal Society and British Academy Data governance: Public engagement review
- speaking to colleagues in different disciplines to get a broader view of the policy issue, including your organisation’s Data Protection Officer
- getting advice from the Government data science community and the Data Leaders Network
Data sources and proportionality
Data insight used to inform policy making and service design must be representative and accurate.
Usually government generally relies on four sources of data for analysis:
- repurposed operational data
- repurposed third party data e.g. social media
- statistics (derived from some other form of raw data)
- purposefully collected data (through new processes)
No one data source is in itself inappropriate to use for analysis, but the needs of the user must always be considered alongside any relevant data protection subject rights, to assess the overall suitability.
For any data being repurposed for analysis, without original individual consent, you must assess whether or not the new purpose is compatible with the original reason for collection (Article 6(4) GDPR).
Repurposed operational data
You must consider the proportionate use of departmental operational data i.e. data collected through the operation of the department or a delivery of one of its services. Although most organisations have well-established processes for using operational data to improve services, proportionality must be considered for each new piece of work.
Repurposed third party data
Repurposed personal data from third party sources
If obtaining personal data from sources, other than directly from an individual, you must be able to make this fair and transparent. This means being able to provide data subjects with the information listed in Articles 13 and 14 of the GDPR within a reasonable period, and no later than one month, after obtaining their personal data.
You should always consider the effects that using this data will have on them and what their reasonable expectations are likely to be. You also need to determine if you are obtaining any special category personal data or data relating to criminal convictions and offences which the GDPR and DPA 2018 gives more protection. In order to lawfully process special category data, you must identify both a lawful basis under Article 6 of the GDPR and a separate condition for processing special category data under Article 9, as supplemented by Section 10 and Schedule 1 of the Data Protection Act 2018.
Social media data
Social media data must be used responsibly. Using some social media data may be considered too intrusive to use without an individuals consent. It can also be difficult to determine the representativeness of social media data when working at regional or national level.
The Government Social Research (GSR) profession has published guidelines for using social media data responsibly in research. The GSR have also published a social media ethics grid which aims to aid ethical decision making when using social media data.
Web scraped data
You need to consider if using web scraped data is appropriate for the intended data analysis. You should also ensure individuals’ privacy is respected. Although information is publicly accessible, this does not automatically provide a lawful basis for processing.
Even where its use is legal, citizens may feel scraping of particular websites is not ethical. Scraping information citizens consider private can be controversial. If scraping social media sites and forums, decide whether your intended use is intrusive or breaches citizen trust.
When web scraping you must:
- always respect website terms and conditions and robots exclusion protocol (like robots.txt)
- make sure you do not breach intellectual property rights if you republish any data sourced from the web
- schedule web scraping activities to minimise the impact on target websites
- not scrape websites anonymously - make sure an identifiable IP address is visible
- have data protection processes in place to manage any personal data you unintentionally collect
Purposefully, newly collected data
If you are collecting new data specifically designed for your project through a tool or application, an End User Licensing Agreement (EULA) must be presented to users. To ensure informed consent, the EULA must fully explain the terms and conditions of their data disclosure, including the objective of the data collection and when it will be destroyed.
Consent is one of six lawful bases for processing personal data, under data protection legislation. GDPR makes it clear that if relying on consent from data subjects, it must be informed, unambiguous and involve a clear affirmative opt-in. Read the UK Data Service guidance on getting consent for research. If your data use falls under a public body carrying out its tasks, it is unlikely that any consent will be considered ‘freely given’. This means consent is unlikely to provide a valid legal ground for data processing.
Production is covered by the statutory Code of Practice for Statistics and subject to independent regulation, ensuring that these data uphold rigorous quality standards, are well documented and are free from bias. Statistics can originate from survey, census and operational data sources, but have usually undergone extensive post-processing and quality assurance.
Statistics are already aggregated and released according to strict statistical disclosure procedures so there is a lower risk of disproportionate use.