You must build and run your digital service in a way that means it’ll work for users when they need it.
This could mean you need it to be available 24 hours a day and 365 days a year.
To achieve the level of uptime you need involves good planning and design.
Design to maximise uptime
Your service should have more possible states than simply ‘on’ or ‘off’. For example you can:
- design your components so that they fall back to minimal functions if something goes wrong
- introduce a read-only mode where users can look at information but not change it
- build in redundancy and avoid single points of failure (eg having only one vendor could be a single point of failure)
- use more than one web server and allow load-balancing between servers, to avoid servers failing
- use database systems which spread data and queries across a cluster, to minimise database crashes
If your service relies on a third party service and this goes down, you can queue information and process it later.
Issues that can affect uptime
Underlying infrastructure availability
Your service’s availability is dependent on the availability of many systems, potentially with multiple suppliers. You may not have a relationship with all of these suppliers. This can become complicated.
Your application is maintained by your development team. The application relies on an application server or database provided by another team in your department. The server or database runs on an infrastructure as a service platform provided via a contract with a commercial supplier. The infrastructure as a service platform relies on network connectivity and power from utilities that you have no direct contract with.
You should understand your dependencies and their dependencies and all the intended uptime expectations.
Some services don’t count pre-arranged maintenance periods as downtime.
For example, a service could claim 100% uptime even though it shuts down every Monday evening for maintenance.
You shouldn’t hide uptime problems behind multiple maintenance periods. You can classify downtime as planned (scheduled maintenance) or unplanned (other problems), if your service genuinely needs scheduled maintenance.
Suppliers and contracts
You shouldn’t underestimate the impact that contracts you agree with suppliers (of products or services) can have on your service’s availability.
You need to fully understand the terms in any contracts you agree, for example:
- service level agreements
- uptime guarantees
Suppliers may miss uptime guarantees or service level agreement response times.
Although they may offer you money or service credits as compensation, you should consider whether this really offsets the effect of the downtime on your users.
If you’re regularly getting credits for uptime problems, consider whether you’re really getting the offered uptime or service level agreement response from your supplier.
Decide on out-of-hours support
If your service fails outside of normal office hours, like evenings and weekends, it’ll be down for a long time unless you’ve got someone responsible for out-of-hours support.
Carry out user research to find out whether your users are likely to use your service during these out-of-hours times. If they are you should:
- put someone in your team on call to deal with any problems
- have dedicated 24/7 support
Tell users about downtime
You may decide that you can’t afford to guarantee your service will be up at all times. If you do this, tell your users when it’ll be down and explain why.
You should have a status page that you can update when there’s downtime you didn’t anticipate.
Case studies and examples
Find out what happens when something goes wrong on GOV.UK, from the Inside GOV.UK blog.
You may also find the Monitoring the status of your service guide useful.
Guidance first published