Guidance

Record information about data sets you share with others

Using metadata to make it easier to catalogue, validate, reuse and share your data.

When you create a spreadsheet, CSV file or other data in tabular form, you should create a record with information about your data and store it with your data. This information is called metadata. By doing this, you will:

  • make your data searchable
  • find it easier to catalogue and validate your data
  • make sure your data is accessible and reusable - often your data is reused even when you do not expect it to be

Refer to the guide on publishing your tabular data, if you’re making your data open. All CSV files should comply with the Tabular data standard.

Who should use this guidance

Use this guidance if you are creating any data in tabular form that you intend to share. Data, in this instance, refers to data sets collected, used and maintained for analytics or for providing government services. It does not refer to finished documents.

You should use this guidance if your government organisation does not currently have metadata guidance for you to use. This guidance will become part of a collection to assist those already working with metadata.

Do not follow this guidance if you are creating, maintaining or managing metadata for geospatial data (that which references data to a location on the surface of the Earth). You should use GEMINI metadata for spatial data sets, including those covered by the INSPIRE regulations. You can also refer to the open standards profiles on ‘Exchange of location point’ and ‘Identifying property and street information’ for more details.

Using metadata in government

By following this guidance, you will be using a consistent metadata vocabulary which will improve interoperability across government. The metadata vocabulary in this guidance uses the Open Standards of schema.org and Dublin Core that are both recommended for government use.

If you are intending to publish your data, you should also read ‘Publishing tabular data’.

Where to record and store your metadata

When recording metadata, it’s useful to store this close to, or with, the data it’s describing.

You can do this by storing metadata:

  • within a data spreadsheet a separate tab
  • in a separate file, such as a readme file, and keep a record showing the link between data and metadata
  • in a Metadata Catalogue if your government organisation has one

When publishing your data, you will need to consider where you store your metadata depending on the types of data you are publishing and how findable you want your metadata to be. Read our guidance on ‘Publishing tabular data’ to understand more about how you publish metadata.

Making metadata machine readable and accessible

To make metadata machine readable and accessible, you must format your metadata in a specific way. For example, use camelCase which is the practice of writing phrases so that there are no spaces between words and each word in the middle of the phrase begins with a capital letter.

When recording your metadata, make sure you use plain English and follow the writing for GOV.UK guide. For example, do not use jargon, and make sure you define technical terms and expand acronyms. Try to avoid using symbols that users might misinterpret.

When you do not have the information you need to record, you can still add the metadata, but add “unknown” when relevant.

Metadata you should record

You should record information that will help others:

  • be informed on where and when your data was collected - use ‘creator’ and ‘dateCreated’ to record who created the data set and the date they created it

  • find the data you’ve saved on a shared network, and identify whether it’s the data set they need - use ‘name’, ‘description’ and ‘identifier’ to describe your data

  • validate the data you’ve collected - use ‘expires’ and ‘supersededBy’’ so users know which version of your data to use, ‘temporalCoverage’ to indicate the time period to which your data applies, and ‘conformsTo’ to tell users whether your file applies to a specific standard or schema

  • use the data you’ve collected appropriately - use ‘hasDigitalDocumentPermission’ to make sure users do not share sensitive data in ways it shouldn’t be and ‘license’ to help users understand their rights to using the data you’ve collected

  • understand the structure and format of your CSV tabular data - use the CSV on the Web open standard and read our guidance on ‘Publishing tabular data’ to get started

Try to avoid recording any metadata that includes personal data. If you include personal data, you will need to comply with the principles, rights and obligations contained in GDPR. You can read the Guide to the General Data Protection Regulation for more information.

Recording dates in your metadata

You must record any dates using the ISO 8601 standard, which is an Open Standard selected for use by the government.

This means listing the date and time elements in descending order of size (years, months, days, hours, minutes, seconds, milliseconds and microseconds). You should provide the right level of accuracy for your data set. For example, if you publish your data set once a year, it might be enough to provide a date down to the day, for example, 2020-07-14. If you publish multiple times a day, it is better to include information down to the second, for example, 2020-07-14T12:57:03Z.

Record the provenance of your data

Using ‘creator’ or ‘contributor’

You should record who created a data set so users can communicate with the creator and understand if the data is relevant to them. For example, a data analyst may want to find out how reliable a data set is before undertaking any analysis.

Record a name for future reference, and an email address if possible. This name and email address should refer to:

  • the name of a team or organisation
  • a role within a team
  • an individual name in some cases - if you can do this while remaining GDPR compliant

For example, creator:”Data Standards Authority team data-standards-authority@digital.cabinet-office.gov.uk

You can use ‘contributor’ instead if multiple organisations or teams are contributing to the data set. You can also use ‘creator’ and ‘contributor’ together for full clarity around where data has come from.

Using ‘dateCreated’

You should record the date when you create a data set to help users of the data set know whether it is valid and relevant to them. You must record the date using the Open Standard ISO 8601.

For example, dateCreated:“2002-10-02”

You must capture the exact time a data set is collected when you’re collecting more than one version of a data set a day.

Help users find, use and identify your data set

Using ‘name’

You must include the name of your data set so users can find and identify the right data set.

You should try to make sure the name captures information that will help users determine whether the data set meets their needs. For example, by capturing the topic and specific information about place and geography.

For example, name:”GDS London Office Employees office commuting tendencies”

Using ‘description’

You can add a description to your data set, in addition to the title, so that users of your data can find out if it’s relevant to them.

The descriptions of your data should only describe the type of data collected and should not include warnings about how to use the data - any warnings should be explained with the term ‘accessRights’.

For example, description:”The amount GDS employees commute to the office and their busiest times to travel. This data also shows the tendencies of GDS employees to work from home”

Using ‘identifier’

You should uniquely identify your data set so that users of your data know exactly which source they’re using.

You should identify your data set by:

  • using the identification system your organisation is using (in cases where organisations have a system in place)

  • using a meaningless identifier you’ve created - this should be random numbers rather than sequential or semi-sequential numbers to avoid meaning being implied

Using a meaningless identifier avoids misunderstanding that comes with applying meaning to identifiers. For example, meaning can change over time. Meaningless identifiers have the ability to be genuinely constant things.

For example, identifier:“362857580”

You can ensure this meaningless identifier stays unique by keeping a catalogue of all data sets with their identifiers.

Using ‘encodingFormat’

You should record the file format in which you store your data so users know how to use and import it.

File extensions are commonly used for your operating system to decide what program to open a file with. Common file extensions include XLS for Excel spreadsheets and CSV.

Example, encodingFormat:”xls”

If you think you may publish your data set, you can also record the media type. Media types are used by browsers to decide how to present some data. For more information read, ‘Publishing tabular data’.

Media type is also known as a Multipurpose Internet Mail Extensions or MIME type. Mozilla keeps a list of common MIME types.

Example, encodingFormat:”jpeg”

Help others validate your data

Using ‘supersededBy’

When the data you’re collecting replaces an older version, you should record this change to make sure users use the most up-to-date version.

You must only use ‘supersededBy’ when the data you’re collecting has:

  • the same period of time and location as the older version of the spreadsheet or file

  • different content to the older version of the spreadsheet or file

The new version of the data will need its own unique URL or identifier. For example,

supercededBy:”https://www.gov.uk/government/organisations/government-digital-service/about/v1

You may also choose ‘isRelatedTo’ as a more generic term that can account for any kind of relationship between resources.

Using ‘supersedes’

You can use ‘supersedes’ as an additional property as this will allow users to understand the history or timeline of a document.

For example, supersedes:”https://www.gov.uk/government/organisations/government-digital-service/about/v1

Using ‘expires’

If you’re no longer using a particular data set, or it has been superseded by other data, you should record it as expired. You can do this by adding the date for when your data set is no longer valid. You should give any replacement data a new title and identifier.

You will often need to remember to revisit your data set to update its metadata when becoming aware of the need for the data set to be no longer used.

For example, expires:”2003-12-04”

Do not use this to record the period of time that applies to your data. In these cases, you should use ‘temporalCoverage’ instead.

Using ‘temporalCoverage’

If you’re collecting data over a range of dates, you should record this so users know the period that the content applies to. You should add this using the Open Standard ISO8601.

For example, temporalCoverage:“2002-10-02/2013-01-01”

If your data does not have a specified end date, you can use ”..” in place of the end date. This follows schema.org date standards.

For example, temporalCoverage:“2020-10-02/..”

Using ‘conformsTo’

You should tell users whether your file conforms to a specific standard or schema so they can easily validate it. This could be the CSV on the Web schema or RFC4180 standard.

For example, conformsTo:“https://tools.ietf.org/html/rfc4180

Your department, agency or local authority might also use a particular schema for specific types of data collection, and you may want to record this. For example the data standards for publishing brownfield land registers.

Make sure your data is used appropriately

Using ‘license’

For protected data such as personal, sensitive or commercial data, you should record information that will help users of the data understand its terms and conditions.

You may want to include the relevant data sharing agreement, legal regulation or certification. This could be a memorandum of understanding (MOU) or Data Protection Impact Assessment.

The open standards vocabularies schema.org and Dublin Core both spell the noun ‘licence’ using the American spelling ‘license’. You should use ‘license’ for consistency.

For example, license:“Memorandum of Understanding between the Charity Commission for England and Wales and the Office for Students”

When publishing open data, you should label the data you’ve collected with its licence for use. In many cases within government, this will be the Open Government Licence (OGL). You should also link to the licence file to explain what the licence means and how others can use your code and content.

For example, license: “https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/”

Using ‘hasDigitalDocumentPermission’

You should record the sensitivity of your data so it’s not shared or published in ways it should not be.

You should provide information about who should be able to access the data you’ve collected, and any restrictions including:

  • whether it’s open or restricted/protected

  • the handling caveat for the data

  • the security classification of data

For example, hasDigitalDocumentPermission:“restricted access”

Published 7 August 2020