Guidance

Approval standards and guidelines: data specification

Updated 15 September 2023

Approval standard: data specification

When must this standard be met

This standard must be met for all applications to access UKHSA data classified as ‘Protected’.

Standard

1. All applications must include a data specification which describes in a precise and understandable manner the data necessary from UKHSA for the conduct of the project:

  • where a UKHSA data dictionary is available for the data, you must use it to capture the required data specification
  • the data specification must identify provenance of the data of interest (the UKHSA system or systems which the data would be obtained from) – where multiple tables or data sources are required to form a relational database, the provenance of each data set must be identified and the table labelled accordingly
  • the data specification must list all data elements (the data items used within data sets) from the data set or data sets of interest using the clinical terminology and classification systems of the data set (herein ‘controlled vocabulary’ or ‘controlled vocabularies’)
  • the data specification must identify the format of the data (record-level or aggregated)
  • the data specification must identify the population or sampling frame of interest, and where available, use controlled vocabularies for classification and indexing of data subjects. The use of vague terms or references must be avoided
  • the data specification must, at a table-level, include any parameters that should be applied to restrict the data to the absolute minimum
  • the data specification must include any reference tables that will be used to apply data minimisation – all reference tables must be presented using the controlled vocabulary of the system and must be supplied in a comma-separated values (CSV) file
  • the data specification must include spatial, temporal or spatiotemporal parameters, for example the specific dates of interest for the sample of interest. It is recommended that any dates conform with the ISO standard: yyyy-mm-dd
  • where derived data elements are requested, the data specification must include relevant metadata that is descriptive of the technical processes that should be used to produce the derived field – for example, where a time interval variable is requested (measure of time between 2 date fields), the fields needed to derive it must be specified
  • where record-level data is requested, the data specification must include relevant person or object identifiers which contain values that uniquely identify each row in the data – where multiple tables are requested and such tables are relational, the data specification must provide the primary and foreign keys
  • the data specification must be consistent with other evidence supplied as part of the application

2. All applications must clearly justify the scope, scale and complexity of the data being requested by including a brief justification alongside each data element which specifies why the item is necessary for the project. This justification of each data element must also be consistent with the overall justification for processing UKHSA data provided in the protocol.

3. The data specification must document only the data which are adequate, relevant and necessary for the purposes for which they are to be processed (data minimisation). Where options are available to limit the identifiability or granularity of a data element, data minimisation approaches must be applied.

4. Where data linkage is required to merge person-level data processed by the applicant to UKHSA person-level data, the data specification must describe the data elements that will be used for linkage and must demonstrate to UKHSA the linkage will be appropriately sensitive and specific:

  • the data specification must identify the provenance of the data elements to be used for the data linkage
  • each data element must be defined using the controlled vocabularies of the system and justified
  • each data element must be unique and persistently identifying
  • the linkage methodology must be described in the project methods and/or data management plan in the protocol
  • the linkage must use the minimum data required
  • the linkage must have a clear and extant legal basis
  • where data linkage is required to access patient confidential data, NHS numbers will be required – if you do not have access to NHS numbers, you must contact UKHSA to discuss

5. Where multiple linkage files will be sent to UKHSA, the data specification must set out how the linkage files will be standardised before disclosure to UKHSA.

6. Where the data is requested on a periodic basis, the frequency of data transfers must be justified.

Guidelines

Article 5(1) of the UK General Data Protection Regulation (UK GDPR) requires that personal data shall be:

(c) adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed (data minimisation)

It is therefore a requirement that any application for protected data demonstrates the data requested will be:

  • adequate – sufficient to properly fulfil your stated purpose (the aims, objectives and methodologies presented in your scientific protocol (see Approval standards and guidelines: scientific protocol for further information)
  • relevant – has a rational link to that purpose
  • limited to what is necessary – no more than you need for that purpose

You must evidence this through producing a detailed data specification in accordance with this approval standard.

Where UKHSA has published a data dictionary specification for the data of interest, the template will guide you through the information you must supply to comply with the standard. Where not available, it is recommended that you provide a standalone document using the practices recommended in these guidelines.

UKHSA does not recommend including your data specification in the protocol, but it should be clear why the data you have requested is needed for your project by virtue of the explanation and justification of it given in the protocol.

Should the application be favourably reviewed, unless otherwise agreed the data will be shared in a CSV file that can be exported into MS Excel, SAS, SPSS, or an ASCII file.

Understanding the availability of UKHSA data

UKHSA publish data dictionaries to enable you to identify and document their data requirements. Each data dictionary contains a detailed technical description for the data set, such as its format, structure and data elements. The main features include:

  • the name of each data element using the controlled vocabulary of the system
  • the definition of the data element
  • a description of the valid values for that data element, or where relevant, the location of the controlled vocabulary or reference tables
  • primary and foreign keys that could be used for linkage between records

Changes to the data dictionaries

Where changes are made to an existing data dictionary, these will be identified in a version log. The log will record if the change is an addition, refinement, discovered redundancy, or minor name change.

Selecting data elements

Should a UKHSA data dictionary not be available for the data set or data sets of interest, it is recommended that your data specification is formatted to provide a clear and structured narrative of the data requested. As a minimum, it is recommended that your data specification includes:

  • the title of each data set, explaining the provenance of the data
  • the data element name (in accordance with the controlled vocabularies of the data set it is to be obtained from)
  • the description of the data element
  • the justification for requesting the data element
  • a description of the data as either record-level or aggregate

Example 1. An illustrative example of data requested from 2 UKHSA data sets: CTAD Chlamydia Surveillance System (Table 1) and GUMCAD (Table 2)

In this example, the planned project will evaluate the demographics of people accessing self- sampling tests for Chlamydia trachomatis infections in England. The protocol for this project describes the following aims:

  1. To compare the proportion of asymptomatic screens and symptomatic tests conducted through self-sampling to community-based services (such as GPs or pharmacies).
  2. To compare the proportion of asymptomatic screens and symptomatic tests conducted through self-sampling to community-based services (such as GPs or pharmacies) by different demographic characteristics (age, ethnicity).
  3. To look at temporal trends in uptake of self-sampling at MSOA unit level.

Table 1. CTAD Chlamydia Surveillance System. Record-level data is requested for all chlamydia tests undertaken in England from NHS laboratories, local authorities and NHS commissioned laboratories between 1 January 2020 and 31 December 2020

Data element Description of the data element Justification for requesting
PatientID (project specific pseudonymised) A unique number used to identify a patient. By default, this ID will be pseudonymised on a project specific basis by UKHSA. Will be used for linkage between Table 1 and Table 2, to associate demographic data in GUMCAD with the type of testing services. This will allow the study to look at testing by different demographic characteristics (age, sex, ethnicity).
Test_Identifier (project specific pseudonymised) A unique identifier of the chlamydia test performed. By default, this ID will be pseudonymised on a project specific basis by UKHSA. Will be used to identify each unique test per individual and if the availability of self sampling has changed behaviours over time.
Testing_Service_Type The type of service or setting providing chlamydia testing. UKHSA published literature has identified that tests from internet services have increased by 34.4% in 2020. We will expand this analysis to evaluate the proportion of Chlamydia testing done via the internet compared to other services for different demographic groups.
Specimen_Date Date specimen taken. Will be used for temporal analysis of tests by Testing_service_Type

Table 2. GUMCAD STI surveillance. Record-level data is requested on all people who are identified in Table 1

Data element Description of the data element Justification for requesting
PatientID (project specific pseudonymised) A unique number used to identify a patient. By default, this ID will be pseudonymised on a project specific basis by UKHSA. Will be used for linkage between Table 1 and Table 2.
Age Age at attendance – derived as the number of completed years between the patient’s date of birth and consultation date. Aim 2 sets out to evaluate testing uptake by sociodemographic characteristics. Our analysis plan describes that we will look at number and rates of tests by age in single years and by ethnicity.
Ethnicity The patient’s ethnicity as stated by the patient. Aim 2 sets out to evaluate uptake by sociodemographic characteristics. The analysis plan describes that we will look at number and rates of tests by age and by ethnicity.
MSOA (derived) Middle Layer Super Output Areas. To be derived by UKHSA from LSOA using ONS output area lookup. CTAD data does not include patient residence data, therefore, to conduct spatial analysis of uptake of Chlamydia testing, MSOA is sought from GUMCAD. This should be derived from MSOA using ONS groupings. We will use MSOA to look at geographical differences in uptake of different test services.
Consultation_ Sypmptomatic Does the patient have symptoms of an STI (Yes or No)? Aims 1 and 2 evaluate differences in symptomatic versus asymptomatic service users. This element will be used to understand differences in uptake by whether the patient presents with symptoms.

As this example utilises relational tables, you will see that there is a ‘PatientID’ in both tables which allows linkage between them. Where row-level data is requested from multiple tables that can be linked together, ensure that each table includes the appropriate primary and foreign keys.

Inclusion and exclusion criteria

You must provide a clear definition of the inclusion and exclusion criteria to identify the population or sample frame of interest, using the controlled vocabulary of the system and avoiding vague terms or references.

It is important that the inclusion and exclusion criteria are precisely defined so that UKHSA can fully understand the population or persons you are interested in and relevancy to your purpose or purposes.

It is recommended that you consider the baseline characteristics of the data subjects, such as age, sex, location, ethnicity, study-specific measures (for example, systolic blood pressure, prior antibiotic treatment) and the presence or absence of other medical, psychosocial, or emotional conditions.

An example of how to draft your inclusion and exclusion criteria is provided in Example 2. In this example, a clinical trial wishes to enrol new participants into a randomised control trial to test a new drug. They want to contact adult patients who have recently been diagnosed with coronavirus (COVID-19) infection and were diagnosed with a heart attack up to 12 months before the onset of their COVID-19 infection. They do not want to cause distress to bereaved families by attempting to contact people who are deceased.

In this example, the inclusion and exclusion criteria provided in column 1 would not provide sufficient detail to meet the approval standard. This is because the language is not specific enough to understand the exact requirements of the study and may lead to different interpretations.

In column 2, the specificity of the inclusion and exclusion criteria has been improved to provide sufficient information to identify the correct data subjects. This includes a clear definition of the age range the study is interested in and a clinical definition of ‘heart attack’ using the controlled vocabulary of the data set.

Example 2. An illustrative example of inclusion and exclusion criteria to be applied to identify participants for a randomised control trial

Insufficient detail Sufficient detail
Inclusion criteria:
• adult patients with previous diagnosis of a heart attack and recently infected with COVID-19, confirmed by a lab test
Exclusion criteria:
• deceased patients
Inclusion criteria:
• adult patients aged 45 years or over and 75 years or under at symptom_onset_date
• diagnosed with PCR confirmed COVID-19 infection (less than 12 months from diagnosis of myocardial infarction) (MI) in HES Admitted Care; where MI is defined as I21.3, I21.4, I21.9, I22, I22.0, I22.1, I22.8, I22.9, I23, I23.0, I23.1, I23.2, I23.3, I23.4, I23.5,I23.6 or I23.8,
• where COVID-19 infection confirmed between 2020-06-01 and 2021-06-01 using SPECIMEN_DATE
COUNTRY_CODE = E (resident in England at diagnosis

Exclusion criteria:
• deceased patients
• patients aged 44 years and under and 76 years and over
• MI greater than 365.25 days prior to symptom_onset_date

Reference tables

Where the data is to be limited to specific events, reference tables must be included in your application detailing these events in accordance with the relevant clinical coding system. All reference tables should be supplied using CSV files.

Example 3 is an illustrative example of OPCS-4.8 clinical classification codes to restrict the episodes within Hospital Episode Statistics (HES) data provided to episodes containing certain types of hip joint replacements. In this example, the file contains the code in column 1 and a description of the code in column 2.

It is recommended that the code and a meaningful description are the minimum information provided for a reference table.

Example 3. Reference table of OPCS-4.8 classification codes provided to restrict to episodes to those containing the relevant procedures

OPCS-4.8_code Procedure_desc
W37.1 Primary total prosthetic replacement of hip joint using cement
W37.8 Other specified total prosthetic replacement of hip joint using cement
W37.9 Unspecified total prosthetic replacement of hip joint using cement
W38.1 Primary total prosthetic replacement of hip joint not using cement
W38.8 Other specified total prosthetic replacement of hip joint not using cement
W38.9 Unspecified total prosthetic replacement of hip joint not using cement
W39.1 Primary total prosthetic replacement of hip joint NEC
W39.8 Other specified other total prosthetic replacement of hip joint
W39.9 Unspecified other total prosthetic replacement of hip joint
W46.1 Primary prosthetic replacement of head of femur using cement
W46.8 Other specified prosthetic replacement of head of femur using cement
W46.9 Unspecified prosthetic replacement of head of femur using cement
W47.1 Primary prosthetic replacement of head of femur not using cement
W47.8 Other specified prosthetic replacement of head of femur not using cement
W47.9 Unspecified prosthetic replacement of head of femur not using cement
W48.1 Primary prosthetic replacement of head of femur NEC
W48.8 Other specified other prosthetic replacement of head of femur
W48.9 Unspecified other prosthetic replacement of head of femur

UKHSA understands there will be some circumstances where medical terminology and clinical coding classifications evolve over time. It is recommended that applicants seek advice to ensure that their requirements can be interpreted in view of a current model.

Derived fields

Derived fields are data elements or values which do not exist directly in the data source but are calculated from one or more data elements which do. Some fields are derived internally to offer data minimisation approaches such as:

  • age bands supplied instead of precise age – these bandings may be in the standard 5 year banding or broader groups (aggregation)
  • dates (such as date of birth) supplied as month and year only or offset by plus or minus a random number of days (partial data removal or offsetting)

Where the data specification for your project includes derived fields, you must provide sufficient information of the required technical processes to produce the derived field.

For example, when documenting details about a time interval you must provide the 2 date fields needed to calculate the interval and the unit measure the calculated field should be presented in. Example 4 provides an illustrative example of a calculated survival interval.

Example 4. Illustrative example of how to present information on the technical processes to be followed to generate a derived field, in this case, a survival interval

Data element Description of the data element Justification for requesting
Derived_Survival_Interval Calculated interval using symptom_onset_date and death_date. Output to be expressed as a number of days. To produce a survival curve to look at survival in the cohort.