© Crown copyright 2020
This publication is licensed under the terms of the Open Government Licence v3.0 except where otherwise stated. To view this licence, visit nationalarchives.gov.uk/doc/open-government-licence/version/3 or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: email@example.com.
Where we have identified any third party copyright information you will need to obtain permission from the copyright holders concerned.
This publication is available at https://www.gov.uk/government/publications/recommended-open-standards-for-government/tabular-data-standard
The Open Standards Board recommends the RFC 4180 definition of CSV (Comma Separated Values) for publishing tabular data in government.
Use this standard, together with the guidance in this profile about character encoding, line breaks, header rows, and Byte Order Mark (BOM) characters.
1. Summary of the standard’s use for government
You can use this standard for publishing data, where your data has a variety of uses and is for many users. For example, open data published on data.gov.uk, or a reference table, such as a list of country names.
Users of the tabular data standard are:
- people who need to use a spreadsheet application to do basic analysis
- analysts using the data in statistical or business intelligence applications to perform interactive analysis
- data scientists writing data analysis software that downloads and processes the data, such as a reproducible analytical pipeline
- developers who process data in a range of software
- people who need to do quick searches for relevant data, before analysing the data with specialist tools
You should use more specific open standards for data if available. For example, you should publish grant data according to the 360Giving schema already selected by the Open Standards Board.
This standard does not cover non-tabular models of data. For example:
- multiple tables with relations, such as SQL
- hierarchical/flexible records, such as XML or JSON
- geographic, such as GeoJSON
- graph/triples, such as RDF
This standard does not cover extra information about the tabular data, such as the column types, or table contents validation. You should keep this type of information separate from a CSV file to avoid it conflicting with the data structure as described in RFC 4180.
Read more about the process of selecting this standard.
2. How this standard benefits users
The standard aims to reduce the common problems users face when using tabular data. For example, collapsing multiple header rows into a single header, or when tabular data is published in CSV but with a non-standard CSV dialect. Detecting the dialect is prone to errors, and often the parser needs to be manually configured to parse the data correctly.
The main benefits of the standard are:
- CSV files will open correctly with a wide selection of software tools
- consistent formatting for more efficient use of data analysts time
- government open data is easier for people to use
- that it is easier for government to work transparently
3. How to use this standard
There are various ways to implement the CSV format, but no formal specification. This standard uses a format that most implementations can follow.
The RFC 4180 format definition states that:
- there are 0 or 1 header rows (not multiple header rows, nor other metadata)
- after the header rows, each row is a record (not, for example, a “totals” of preceding rows; no blank rows)
- fields are separated by commas (not tabs or pipes)
- quotation with double quotes, always when needed (such as when a field contains a comma) and optionally when not
- all rows have the same number of fields (do not miss off trailing commas)
- line breaks are: “\r\n” (CRLF - Windows-style)
When using this standard you should follow these recommendations that change how to use RFC 4180.
RFC 4180 says you can use any encoding for non-ASCII characters, but the UK Government specifies UTF-8 for cross-platform character encoding.
RFC 4180 says that line endings (between records) must be CRLF. You can also use LF (“\n” - Unix-style), which follows with W3C Best Practice.
RFC 4180 accepts data with no header rows, but you should use one header row. Column headings are helpful for users to understand the meaning of the data. In a series of releases the header row is useful to show changes, such as columns changing order.
RFC 4180 does not include a Byte Order Mark (BOM). Excel for Windows usually inserts a BOM when saving a CSV file that contains non-ASCII characters, so you should remove the BOM before publishing. Also, when you’re using Excel for Windows to load a file that is UTF-8 without a BOM, you can resolve the display of non-ASCII characters by using the ‘import text’ option.