Guidance

4. Understand the limitations of the data

How to implement principle 4 of the Data Ethics Framework for the public sector.

Though legal and proportionate, there may be limitations to your data that make your proposed approach inappropriate, unreliable or misleading - and therefore unethical as a basis for public sector policy making or service design.

Things to consider when deciding if a source of data is suitable include:

  • provenance (for example how and why the data was collected)
  • errors in the data
  • bias (from historical decision making, or unrepresentative surveys or social media)
  • if metadata and field names are ambiguous

Provenance

When designing a new use of data, you must understand the impact of data provenance on accuracy, reliability and representativeness.

Specifically assess the impact of:

  • the source of the data, such as a transactional service, survey, administrative task in a public sector organisation, a government department, social media or open dataset
  • whether the data was collected by humans or an automated system
  • how well the data reflects its target population
  • any likely omissions, exclusions or systematic biases
  • patterns in the data and whether they are likely to stay static or change over time
  • quality assurance processes when the data was collected
  • the sampling strategy used to collect the data
  • any other problems surrounding data collection

Errors

Errors in data are inevitable; however it can be difficult to understand how frequent they are, if they are random, the cause and ways to mitigate or remove them. Errors are not always immediately obvious, especially in large datasets. Simple data visualisations can be the best way of spotting anomalies and systematic errors.

You will need to consider and document how identified errors will impact the work.

If you find errors in the way data is collected or interpreted, report them to policy or operational staff.

The UK Statistics Authority Quality Assurance of Administrative Data framework provides useful resources to help you understand the data that you are using, how it was collected and any likely quality impacts.

Bias

You should be aware of the types of bias that can exist in the data you are using by reviewing how the data was collected.

There are many ways in which bias can be introduced into datasets, through collection techniques, limited representativeness of a particular cohort and social bias from historical decision making.

Carefully considering potential bias and its impacts on outputs from data analysis is technically well established. When using data in more contexts, to inform policy or service design, it’s critical to involve policy or subject matter experts to fully consider types of bias which might not be immediately obvious to a data practitioner.

Measurement bias

Bias in measurement is the selection of data or samples in a way that does not represent the true parameters (or distribution) of the population.

Social bias

Any data sources about citizens, collected from services, surveys, or elsewhere will contain some level of social bias as the information is based on historic decisions and actions by humans, or was shaped by laws no longer in force.

Bias in training data leads to bias in algorithms. Machine learning is a data-driven technology and the characteristics of the data are reflected in the properties of the algorithms.

Read more about social bias in algorithms in Principle 5.

Social media

Data from social media sources may give valuable real time or historic insight, but the data should be properly investigated to identify any representation or selection bias. Include metrics and caveats about who or what the data is representative of, and importantly, what you cannot determine from the data.

Practitioner bias

Data practitioners and others involved in a project may inadvertently introduce their own confirmation bias into the design of projects, analyses, or interpreting outputs. Ensuring a diverse team from a range of backgrounds is a good way to mitigate potentially damaging practitioner bias.

Survey methodology

Surveys must be carefully designed and used to ensure they cover your target population. Low response rates may mean it’s inappropriate to use survey data.

If the proportion of non-respondents or ‘invisible data’ is too high in a survey, it may be irresponsible to describe the results of work with this data as representative. Determining your proportion of ‘invisible data’ is also crucial when using machine learning to spot correlation or network effects.

Response and selection bias affect the generalisability of findings. With high bias, you can’t infer that patterns exist beyond the sample who responded.

Metadata and field names

Metadata and the names of fields in datasets can be misleading or inaccurate. It’s critical that you work with the subject matter expert for the dataset collection to understand if it is fit for purpose for your project. Improve documentation of the metadata, if you can.

Published 13 June 2018