How to choose data tools and infrastructure that are flexible, scalable, sustainable and secure.
There are 4 main areas to consider:
- Choose analytical tools that keep pace with user needs.
- Use the cloud for the whole development cycle.
- Use appropriate security when using data in the cloud.
- Choose open data standards for better interoperability.
1. Choose analytical tools that keep pace with user needs
As data analysis and data science evolves, you should choose tools and techniques that can adapt and support best practices. The UK Statistics Authority’s Code of Practice for Statistics provides information on best practices including:
- keeping up to date with innovations that can improve statistics and data
- improving data presentation for their users
- testing and releasing new official statistics
Government workers responsible for providing or procuring software for data analysis should choose a loosely coupled modular system. These systems are flexible enough to use with a variety of tools and connect to a range of data sources and architecture.
You should build data architecture in an agile way, to iterate and add value with each change. If you make up-front long-term decisions on one type of software you risk being unable to meet evolving user needs.
Choosing open source languages
Data scientists and analysts often use common open source languages such as Python and R. The benefits of using these languages include:
- good support and open training - which means reduced training costs
- new data analytics methods
- the ability to create your own methods using open source languages
The R and Python communities develop large collections of packages for data analysis. The packages for R and the packages for Python, including many data science packages, provide extensive analytical functionality benefits.
Choosing tools that work with open technology
Choosing tools which work with open technology supports robust and appropriate data analysis, as set out in Principle 5 of the Data Ethics Framework.
Tools which work with open technology, such as Docker or Apache Spark, give your team the flexibility to meet your users’ needs. Open tools are usually designed to work together and across vendors. Benefits include the ability to:
- script a data pipeline using the best software for each task
- run your code anywhere - using commodity container platforms, platform as a service or a Hadoop cluster
Other benefits include better:
- support software for engineering practices
- capabilities in big data and machine learning
You can achieve better quality assurance in your software development with continuous integration and unit tests. The Reproducible Analytical Pipeline community has guidance about doing quality assurance in analytical pipelines.
If you use spreadsheets and business intelligence software you should be aware they:
- do not often scale well to large datasets or intensive computation
- do not often integrate well into automated pipelines, or support best practices in quality assurance
- often need paid-for licences, making them expensive to deploy in the cloud
Case study - using data science with the Ministry of Justice analytical platform
The Ministry of Justice Analytical Platform supports the latest data analysis tools. This platform allows easy integration of new open source software and leading cloud services into a platform for 300 staff in the data analysis professions.
The platform is a flexible and secure environment, where:
- analysts use a web browser to sign-in once and then develop code in tools such as R and Python
- you can access data and create live charts and dashboards that are accessible to end users in a web browser with no special software or licences
- it runs software using standardised containers on an auto-scaled Kubernetes cluster which allows the platform to run any of the latest open source tools, data stores and custom-built data analysis components
- you can add innovative services such as a new graph database or machine learning framework
- you can process datasets of almost unlimited size at low cost
The platform has helped the Ministry of Justice produce several national statistics more reliably and efficiently by using reproducible automated pipelines.
2. Use the cloud for the whole development cycle
In most circumstances, you should store data in the cloud and code in cloud-based version control repositories. You should run live data products in the cloud, as set out in the government’s Cloud First policy, and you should use the cloud throughout the whole development cycle.
Keeping your data in the cloud
You can use cloud services for data analysis work. It’s usually more efficient to use software-as-a-service and only pay for what you use, rather than setting up and running your own cluster for data. With cloud services it’s important to be alert to supplier ‘lock-in’ and always consider the cost of switching to another supplier.
The benefits of storing your data in the cloud are that:
- it scales well to large quantities of data that would not comfortably fit on a user’s machine
- you can take advantage of cloud-scale databases to process complicated queries in a reasonable time-frame
- you can use it for all stages from exploration through to production systems
- it’s simpler to combine different datasets from your organisation
- it’s usually the cheapest option, due to commoditisation and pay-as-you-go pricing, but evaluate this against your own needs
Rather than sharing data files by email, the cloud enables data sharing by sending a link. This is a better practice because it helps you:
- control and monitor access to the data
- maintain connection to the original source data so you can avoid duplication and poor version control
- get reports with live updates
When using data in the cloud make sure that your data is accessible through a stable URL or other endpoint, as this will help you to make reproducible analysis code.
Maintaining cloud-based version control
To maintain cloud-based version control and support collaboration, you should:
- use a cloud-hosted repository, such as GitHub, to create pull requests
- peer review code on a regular basis, to make sure you maintain the appropriate quality and keep all stakeholders up to date with any changes
- share code outside your team and organisation
- manage a list of issues
- encourage reviews and invite comment
Reproducibility with the cloud
Cloud-based version control allows you to run automatic tests, which help you to make data analysis ‘reproducible’.
You should aim to make your data analysis reproducible so it’s easy for someone else to work with it. For example, sharing your code and data so that someone else can run your data model on another computer at a different time with the same results. This is important because someone can:
- check how your data analysis works
- test your data analysis with different queries
- run the analysis on a different dataset, or build on the analysis
Data analysts can make their data analysis reproducible by:
- writing code that runs their analysis, rather than doing analysis through a series of manual steps, such as manual clicks in a graphical user interface
- using the cloud for storing their data
- setting up continuous integration and automated testing on all users’ platforms
- specifying the library dependencies as well as their version numbers
It’s standard practice to specify dependencies and automate testing throughout software development. Unless you’re doing quick, throw-away experiments you should aim to make all your code reproducible.
Using a cloud development environment
- does not need to install something on their machine that would demand maintenance and updates (a benefit of software as a service)
- can install code libraries across all users’ environments, which makes the code easier to share and reproduce
- is not tied to a particular corporate network, enabling users and collaboration from outside the organisation
Using a cloud environment for data development also means that you:
- often have easy access to other cloud data services and cloud-hosted data due to the platform’s built-in credentials
- can decide which software to install
- can decide who is best placed to install software, such as using a platform team who understand analysts’ needs
- are less likely to see users keeping local copies of the data on their laptop for development, especially if the data is also in the cloud
Teams using a development environment on local machines might risk:
- not having administrator access for security reasons, which will prevent installation of development software
- having longer installation times in Python or R installations and libraries
- having less access to cloud-hosted data which might cause users to create workarounds, such as using email to circulate data
Sometimes, your data analyst may prefer to use their own custom environment. Where practical, you should aim to be flexible and try to replicate the essential elements of the cloud environment on their local machine.
The cloud environment offers a baseline of libraries, but as soon as you need more libraries you should specify all the dependencies and their version numbers, to make sure work is still reproducible.
3. Use appropriate security when using data in the cloud
The government’s approach to security in the cloud is set out in the Cloud Security Principles from the National Cyber Security Centre (NCSC). Also, in the Risk Management Principles, NCSC states that the commercial public cloud is an acceptable place to put OFFICIAL data.
NCSC considers the cloud to have acceptable security because:
- there is less information on end user devices
- the supplier applies regular upgrades and security patches
- the supplier often has rigorous methods to audit data, and control access and monitoring
Whether you’re procuring SaaS or developing your own solution for a platform of tools and services, you should put in place mitigations such as:
- data encryption
- single sign-on
- two-factor authentication (2FA)
- fine-grained access control
- usage monitoring and alerts
- timely patching
Other security challenges for data analysts include developing code on a platform with:
- real data
- internet access
When platforms have internet access and hold real data, threat actors or attackers may try to steal or alter the data. Also, there is a greater risk of an accidental real data leak.
You should integrate security controls and monitoring with the data and network flows. This should be proportionate to the risks faced in experimental, collaborative and production environments.
Balance security choices with user needs
Security should protect data, but not stop users from accessing the data they need for their work. The Service Manual has guidance on securing information for government services.
You should build security into a system so it’s as invisible to the user as possible. Adding complicated login procedures, and restricting access to the tools users need, does not make your security better. Restrictive security makes shadow IT more likely, with users avoiding security measures and finding workarounds.
Case study - using Ministry of Justice data in the public cloud
There is a government policy supporting the use of the cloud for personal and sensitive data. Most UK departments have assessed the risks, put in appropriate safeguards and moved sensitive data into the public cloud.
An example of this is from the Ministry of Justice who moved their prisoner data into the public cloud. This data has an OFFICIAL classification and often the ‘SENSITIVE’ handling caveat. It includes information such as health records and the security arrangements for prisoners.
The project team makes sure the appropriate security is used, such as:
- careful isolation between elements using cloud sub-accounts, Virtual Private Clouds (VPCs) and firewall rules
- finely grained user and role permissions
- users logging in with two-factor authentication (2FA)
- being able to quickly revoke or rotate secrets, encryption keys and certificates
- frequent and reliable updates using peer-review and continuous deployment
- extensive audit trails
Hosting the data in the cloud has enabled the Ministry of Justice to perform additional analysis using modern open source tools and scalable computing resources through its Analytical Platform.
It’s possible to achieve this level of security and functionality with a private data centre, but it would be a huge investment in hardware, software and expert staff to design and maintain it. You can reduce these issues by using the public cloud and taking advantage of the continuous investment and developments made by the suppliers.
4. Choose open data standards for better interoperability
An open data standard specifies a way of formatting and storing data. This can make data compatible with a wide range of tools in a predictable fashion, and prevents lock-in to proprietary tools. Open standards allow organisations to:
- share information even when they do not have access to the same tools
- replace their tools and still have access to their data
- make a strategic decision to provide an agile environment that changes with the needs and capabilities of the users
The Open Standards Board selects open standards for use by government.
Examples of open standards include the: