Guidance

The Government Data Quality Framework

Published 3 December 2020

Image description: Sir Ian Diamond and Alex Chisholm

Foreword

We find ourselves living in a society rich with data and the opportunities presented by this. In such an age, it is essential that public bodies have confidence that the data they access and process is fit for its intended purpose. Government’s ambitions around digital transformation of public services and the UK becoming a world leader on AI are predicated on access to good quality data to inform decision-making and service delivery.

Yet concerns have been raised over the quality of data collected, created and used by government. Poor quality data in government leads to failings in services provided, poor decision-making, and an inability to understand how to improve. The 2019 Public Accounts Committee Report (PDF, 303KB) showed that data has not been treated as an asset, and how it has become normal to ‘work around’ poor-quality, disorganised data.

The extent of the data quality problem within government is poorly understood. Work on data quality is often reactive and not evidence-based. Where quality problems have been identified, the symptoms are often treated instead of the cause, leading to ineffective improvements and wasted resources.

Government needs a more structured approach to understanding, documenting and improving the quality of its data. This framework provides that, through data quality work that is proactive, evidence-based and targeted. It presents a set of principles for effective data quality management, and provides practical advice to support their implementation. While there is no such thing as ‘perfect quality’ data, we must strive for a culture of continuous improvement. All public servants should understand why data quality is important, and feel able to proactively identify and address data quality issues.

Through improved management of data, government can achieve the high quality data needed to deliver better outcomes for society. For many organisations, this is a journey that will take time and commitment. We ask that all government departments endorse and adopt this framework, and work to align their approach to data quality with these principles.

Professor Sir Ian Diamond, National Statistician and Alex Chisholm, Chief Operating Officer for the Civil Service

Acknowledgments

The Government Data Quality Hub would like to thank the Data Management Association of the UK (DAMA UK) for their input into the development of this Data Quality Framework. The framework draws heavily on the Data Management Body of Knowledge (DMBoK) and DAMA UK’s Data Quality Dimensions white paper.

We would also like to thank the Data Standards Authority for their input and the Cabinet Office, Home Office, Office for National Statistics, NHS Digital, Environment Agency and Government Digital Service for contributing case studies.

Why do we need a data quality framework?

Data is fundamental to effective, evidence-based decision-making. It underpins everything from major policy decisions to routine operational process. Often, however, our data is of unknown or questionable quality. This presents huge challenges. Poor or unknown quality data weakens evidence, undermines trust, and ultimately leads to poor outcomes. It makes organisations less efficient, and impedes effective decision-making. To make better decisions, we need better quality data.

At a high-level, data quality can be thought of as ‘fitness for purpose’ – is this data set good enough for what I want to use it for? The level of quality required will vary depending on the purpose, but will often consider several dimensions. Data quality is more than just data cleaning.

At present, we lack a consistent approach to managing data quality across government. This framework draws on international and industry best practice and sets out a series of principles, practices and tools aimed at achieving fit for purpose data. The framework asks organisations to develop a ‘culture’ of data quality, by treating issues at source, and committing to ongoing monitoring and reporting. It advises targeting improvements where they add most value, and encourages individuals to proactively manage data quality in their roles.

Applying the framework will enable us to make better use of our data.

The strategic context

The publication of this Data Quality Framework is a commitment made in the National Data Strategy under the Data Foundations pillar. The National Data Strategy recognises that by improving the quality of data, we can drive better insights and outcomes from its use. The publication of this framework is an integral part of the strategy’s commitment to tackle the cultural and coordination barriers to good quality data, and to ensure that the true value of data is realised.

The framework focuses primarily on assessing and improving the quality of input data rather than the quality assurance of analytical outputs. The HM Treasury Aqua Book provides guidance on quality in the production of analysis, while the Code of Practice for Statistics sets out the principles to ensure that statistics are fit for their intended purpose.

The framework complements existing ambitions to improve the quality of government data and analysis, such as those in the Government Analysis Functional Standard and the UK Statistics Authority five year strategy. It draws on international best practice in quality management – such as the International Organization for Standardization’s Quality Management Principles – and translates this into the context of government data.

How to use the data quality framework

The framework is relevant for anyone working directly or indirectly with data in the public sector. This includes data practitioners, policy-makers, operational staff, analysts, and others producing data-informed insight. Senior leaders should be advocates for the framework in their departments, and should encourage staff to adopt the practices in their roles. All civil servants should familiarise themselves with the data quality principles and, where relevant, apply them in their context.

The framework is split into two parts, and accompanied by a set of case studies.

The first part of the framework provides a structure for organisations and individuals to frame their thinking around:

The second provides guidance on practical tools and techniques which can be applied to assess, communicate and improve data quality:

The ask to adopt the framework is directed at central government. Many of the concepts and approaches are broadly applicable, however, and the framework serves as a useful guide for anyone wanting to improve data quality.

Data quality principles

These principles are guidelines to aid the creation of a strong data quality culture in your team or organisation. They explain the best practice, procedures and attitudes that will be most helpful to ensuring your data is fit for purpose.

These principles should lie at the heart of your approach to data quality and be supported by the application of the products within the framework. Each principle is accompanied by a set of practices which support their adoption.

The principles are:

  1. Commit to data quality

  2. Know your users and their needs

  3. Assess quality throughout the data lifecycle

  4. Communicate data quality clearly and effectively

  5. Anticipate changes affecting data quality

1. Commit to data quality

Create a sense of accountability for data quality across your team or organisation, and make a commitment to the ongoing assessment, improvement and reporting of data quality.

1.1 Embed effective data management and governance

Fit for purpose data depends on effective data management and governance practices.

Individuals and organisations should:

  • adopt formal data governance practices to ensure that data is managed properly
  • adhere to agreed data principles, such as those being developed as part of the National Data Strategy
  • apply data standards to ensure that data is reusable and interoperable
  • ensure that accountability for data quality is present at all levels:
    • Data leaders should guide an organisation or team’s strategic direction by ensuring awareness and improvement of data quality
    • Data practitioners should ensure that measuring, communicating and improving data quality is at the forefront of activities relating to data

1.2 Build data quality capability

Individuals and organisations should:

  • dedicate time and resource to building capability in assessing, improving and communicating data quality through training and sharing best practice
  • include best practice in data quality management (such as the data quality dimensions) as part of training materials

1.3 Focus on continuous improvement

Continuous improvements can help to avoid data quality problems before they occur.

Individuals and organisations should:

  • benchmark and regularly assess levels of data quality over time to track changes in quality
  • prioritise and iterate effective improvements to achieve fit for purpose data
  • use data quality action plans to identify and define where efforts should be prioritised

2. Know your users and their needs

Understanding user needs is essential to ensuring that data is fit for purpose. Research and understand your users’ needs. Prioritise efforts on the data which is most critical.

What is a user?

Users are the teams, businesses, services and people that will be making use of your data. For example, they may have business needs that rely on fit for purpose data from a trusted source, or they may be an enquiring member of the public looking to understand more about their local area.

You may have more than one type of user of your data. Different users’ needs may conflict, so it is important to balance these needs and prioritise having fit for purpose data. It is unlikely that data will be equally fit for all purposes.

More detailed information on users can be found in the GOV.UK Service Manual and, in the context of users of Official Statistics, in the forthcoming User Engagement Strategy for Statistics.

2.1 Research your users and understand their quality needs

To achieve fit for purpose data, it is essential to understand your users’ quality needs.

Individuals and organisations should:

  • proactively engage with users to understand their priorities
  • carry out user research if faced with a large, complex or poorly understood group of users
  • capture a range of user needs for data which has multiple uses
  • balance the conflicting needs of users where possible and prioritise improvements which have the greatest impact
  • regularly communicate with users to understand any changes in their requirements

3. Assess quality throughout the data lifecycle

Data should be managed across its lifecycle, paying close attention to quality measures and assurance at each stage.

What is a data lifecycle?

The data lifecycle is a way of describing the different stages the data will go through from design and collection to dissemination and archival/destruction. The flow of the data is not always sequential so you may need to return to previous stages to fix data quality issues.

The purpose of the data and its lifecycle should be well understood by anyone who handles the data, from its collection to the eventual output.

More detailed information is available in the data lifecycle section of the framework.

3.1 Assess data quality at all stages of the lifecycle

Quality assurance should take place across the entire data lifecycle. Data quality issues can occur at any stage and can have knock-on effects for the rest of the lifecycle.

Individuals and organisations should:

  • assess data quality at every stage and take proactive measures to improve quality when issues arise
  • adopt appropriate assessment measures at each stage rather than applying a one-size-fits-all approach to quality assurance
  • focus quality improvements as early in the lifecycle as possible to maximise their effectiveness

3.2 Communicate with users and stakeholders across the lifecycle

Different stakeholders will often be involved across the data lifecycle.

Individuals and organisations should:

  • develop effective communication channels with and between stakeholders to ensure a broad understanding of data quality
  • communicate any changes in data quality to stakeholders at all stages of the lifecycle
  • proactively engage with data providers to ensure a clear understanding of data quality requirements

4. Communicate data quality clearly and effectively

Communicate quality to users regularly and clearly to ensure data is used appropriately.

4.1 Communicate data quality to users

Individuals and organisations should:

  • provide clear data quality information and describe its impact on use of the data
  • communicate trade-offs in data quality clearly to aid understanding of the data’s strengths and weaknesses
  • be transparent about the quality assurance approach taken and communicate data quality issues clearly to users
  • build strong relationships with suppliers of external data to identify data quality problems at source
  • inform users in advance about changes made to data processes which could impact on quality
  • communicate clearly and in plain language, following relevant style guides for published materials
  • provide clear definitions of terminology used and not presume a high level of user understanding of data quality

4.2 Provide effective documentation and metadata

Individuals and organisation should:

  • document and share metadata to minimise ambiguity and enhance opportunity for data access and reuse
  • document and report data quality issues and be transparent about steps being taken to address them

5. Anticipate changes affecting data quality

Not all future problems can be predicted. Where possible, anticipate and prevent future data quality issues through good communication, effective management of change and addressing quality issues at source.

5.1 Plan for the future

Individuals and organisations should:

  • use root cause analysis to solve data quality issues at source, rather than apply temporary fixes
  • regularly communicate with users to keep up with changing data and quality requirements
  • proactively consider the impact of changes in systems on data quality
  • integrate quality processes into the design of new data systems
  • ensure metadata and other supporting documentation is thorough and up-to-date

The data lifecycle

What is the data lifecycle?

The data lifecycle is a way of describing the different stages that data will go through, from collection to dissemination and archival/destruction. The purpose of the data and its lifecycle should be well understood by anyone who handles the data, from its collection to the eventual output.

This section of the framework describes the stages of the data lifecycle in more detail, and outlines quality issues that may occur at each stage.

Quality across the data lifecycle

Quality assessment and assurance should take place at each stage of the lifecycle. The measures used will change at each stage.

Throughout the data lifecycle, those involved should be aware of future users of the data and possible onward uses of the data, and should ensure that data quality at each stage is documented and communicated clearly.

Data practitioners may sometimes need to return to earlier stages in the lifecycle to correct data quality problems.

The stages of the data lifecycle

The data lifecycle illustrated here is not intended to be prescriptive. It is designed to illustrate the journey that data will take through most organisations and identify points at which data quality problems could happen. The actual data lifecycle for an organisation will be specific to the organisation and its processes.

Data leaders may find it helpful to use the data lifecycle here to design one for their own organisation.

Plan

This stage is where an organisation or team intending to collect, store and use data must plan their processes and data storage. The planning stage involves determining business needs, identifying what data exists already and what needs to be collected or acquired. It also involves designing how this data will be collected and managed.

Planning is one of the most important stages in the data lifecycle. Good planning can prevent problems in data quality before they occur.

Potential data quality problems

The following data quality problems could happen at this stage:

  • poor design of data collection
  • lack of data validation rules
  • failure to specify use of master or reference data
  • lack of data standards
  • design does not consider ‘upstream’ data use

Collect or acquire, and ingest

During the collection and ingestion stage, an organisation or team will acquire data based on user needs. They can improve the quality at source through validation rules and capturing appropriate metadata.

Potential data quality problems

The following data quality problems could happen at this stage:

  • errors in manual data entry
  • incomplete data
  • duplicated data
  • inconsistent formats
  • insufficient information about the data that has been received (for example from producers of administrative data)
  • insufficient or poor-quality metadata
  • a failure to ensure effective feedback loops between users of data and the collection/acquisition of data (particularly where it is outsourced to third parties)

Prepare, store and maintain

At this stage data is prepared for storage, formatted for use at further stages in the data lifecycle and maintained for use within the organisation. Consistent standards should be applied to the data and where necessary, the data should be anonymised. Where possible, data should also be cleaned and linked with other records in organisational data stores. This can help to reduce quality problems such as duplication and issues of consistency.

Data may then be integrated into the organisational data stores. Practitioners ensure the data is stored appropriately and provide the access necessary to business users. Any data that is subject to change should be regularly monitored for its data quality to ensure it continues to be fit for purpose.

Potential data quality problems

The following data quality problems could happen at this stage:

  • lack of adequate data preparation
  • inaccuracies and corruption resulting from integration of data sources and feeds
  • lack of informative metadata
  • lack of documentation
  • incorrect data linking
  • inconsistent standards applied to the data, which can lead to problems when linking
  • data accuracy decaying over time
  • system changes causing inconsistencies
  • data not being actively managed

Use and process

At this stage of the data life cycle, data is processed and used for the specified business needs. This may involve exploration and analysis of the data, as well as production of outputs.

Potential data quality problems

The following data quality problems could happen at this stage:

  • failure to adhere to organisational data management practices and principles
  • failure to identify and log errors
  • failure to understand and address known quality issues
  • failure to carry out risk-based assessment on whether to use data because of poor understanding of data quality
  • human error in manual production of analysis and outputs

Share and publish

Data is shared where it is appropriate for processing for secondary purposes. Where the data is suitable for publication, data should be quality assured, anonymised and made available with appropriate documentation including details on its quality. Open data published by public authorities should be released in consistent and accessible formats, to improve its utility.

Potential data quality problems

The following data quality problems could happen at this stage:

  • unidentified errors in the shared or published data due to poor quality assurance
  • publication of low-quality data due to poor understanding of its timeliness and relevance
  • lack of documentation and informative metadata to allow risk-based decisions on whether to use data

Archive or destroy

Once data is no longer in active use the data owner should determine whether it should be archived (available and secure) or destroyed. Information about the quality should be stored with the data.

Potential data quality problems

The following data quality problems could happen at this stage:

  • integrity of data compromised by changes made to it after it is archived
  • loss of organisational knowledge about the data and its quality

Case study

The following case study provides an example of how an organisation has developed and implemented its own data lifecycle:

Office for National Statistics: The ONS Data Service Lifecycle

Data quality dimensions – how to measure your data quality

According to the Data Management Association (DAMA), data quality dimensions are “measurable features or characteristics of data”. They can be used to make assessments of data quality and identify data quality issues. They should be used alongside data quality action plans to assess and improve the quality of your data.

There are six core data quality dimensions, as defined by DAMA UK. This is not a prescriptive list and may vary depending upon your data and your users’ needs. For example, a seventh dimension may be added to measure the quality of any specialist data, or you may not consider certain dimensions relevant in your context. Other organisations define quality dimensions slightly differently. The European Statistical System, for example, defines a set of quality dimensions for statistical outputs in its Quality Assurance Framework (PDF, 915KB).

Core data quality dimensions

This section describes the six data quality dimensions as defined by DAMA UK, and provides examples of their application. These examples are taken (and sometimes adapted) from the DAMA UK Working Group “Defining Data Quality Dimensions” paper.

Completeness

Completeness describes the degree to which records are present.

For a data set to be complete, all records are included, and the most important data is present in those records. This means that the data set contains all the records that it should and all essential values in a record are populated.

It is important not to confuse the completeness of data with its accuracy. A complete data set may have incorrect values in fields, making it less accurate.

Example of application

A school collects forms from parents on emergency contact telephone numbers.

There are 300 students, but 294 responses are collected and recorded.

294/300 x 100 = 98%.

The emergency contact telephone number field is therefore 98% complete. However, these phone numbers may not all be correct, so the telephone number field is not necessarily accurate.

Uniqueness

Uniqueness describes the degree to which there is no duplication in records. This means that the data contains only one record for each entity it represents, and each value is stored once.

Some fields, such as National Insurance number, should be unique. Some data is less likely to be unique, for example geographical data such as town of birth.

Example of application

A school has 120 current students and 380 former students (i.e. 500 in total).

The student database shows 501 different student records.

This includes Fred Smith and Freddy Smith as separate records, despite only one student at the school named Fred Smith.

This shows that the data set has a uniqueness across all records of 500/501 x 100 = 99.8%.

Consistency

Consistency describes the degree to which values in a data set do not contradict other values representing the same entity. For example, a mother’s date of birth should be before her child’s.

Data is consistent if it doesn’t contradict data in another data set. For example, if the date of birth recorded for the same person in two different data sets is the same.

Example of application

In a school, a student’s date of birth has the same value and format in the school register as that stored within the student database.

Timeliness

Timeliness describes the degree to which the data is an accurate reflection of the period that they represent, and that the data and its values are up to date.

Some data, such as date of birth, may stay the same whereas some, such as income, may not.

Data is timely if the time lag between collection and availability is appropriate for the intended use.

Example of application

A school has a service level agreement that a change to an emergency contact will occur within 2 days.

A parent gives an updated emergency contact number on 1 June.

It is entered into the student database on the 4 June.

It has taken 3 days to update the system which breaches the agreed data quality rule.

Validity

Validity describes the degree to which the data is in the range and format expected. For example, date of birth does not exceed the present day and is within a reasonable range.

Valid data is stored in a data set in the appropriate format for that type of data. For example, a date of birth is stored in a date format rather than in plain text.

Example of application

Primary and Junior School applications capture the age of a child. This age is entered into the database and the age checked to ensure it is between 4 and 11. Any values outside of this range are rejected as invalid.

Accuracy

Accuracy describes the degree to which data matches reality.

Bias in data may impact accuracy. When data is biased it means that it is not representative of the entire population. Account for bias in your measurements if possible, and make sure that data bias is communicated to your users.

In a data set, individual records can be measured for accuracy, or the whole data set can be measured. Which you choose to do should depend on the purpose of the data and your business needs.

Example of application

A school receives applications for its annual September intake and requires students to be aged 5 before 31 August of the intake year.

A parent from the USA completes the Date of Birth (D.O.B) on the application in the US date format, MM/DD/YYYY rather than DD/MM/YYYY format, with the days and months reversed.

The student is accepted in error as the date of birth given is 09/08/YYYY rather than 08/09/YYYY.

The representation of the student’s D.O.B. – whilst valid in its US context – means that in the UK the age was not derived correctly, and the value recorded was consequently not accurate.

User needs and trade-offs

Understanding user needs is important when measuring the quality of your data. Perfect data quality may not always be achievable and therefore focus should be given to ensuring the data is as fit for purpose as it can be.

This may result in trade-offs between different dimensions of data quality, depending on the needs and priorities of your users. You should prioritise the data quality dimensions that align with your user and business needs.

For example, if the timeliness of a data set is the most important dimension for the user, this may come at the expense of the data set’s completeness, and vice versa.

It is important to communicate these trade-offs to the users of your data to avoid ambiguity and misuse of the data.

Trade-offs example

In 2018 the Office for National Statistics (ONS) introduced a new model for publishing Gross Domestic Product (GDP). This enabled monthly estimates of GDP to be published. However, there was a trade-off between timeliness and accuracy of the data.

This framework has supporting guidance. It provides a set of practical tools and techniques which can be used to assess, communicate and improve data quality.