Beta This is new guidance. Complete our quick 5-question survey to help us improve it.
Monitoring the status of your service
You must make sure your service has monitoring tools that tell you about problems which are affecting it, for example:
- any issues with the infrastructure that supports the service
- any sudden increase in the number of user errors
- users not completing the task
Meeting the Digital Service Standard
You must monitor the status of your service to meet these points:
You’ll have to explain how you’ve done this in your service assessments.
Why monitoring matters
Monitoring the status of your service and its infrastructure allows you to:
- identify problems before they happen or become more serious
- find out about problems that you need to solve urgently
- get alerts when a problem affects your service’s availability, so you can fix it
- get help with capacity-planning activities by providing metrics over time
- find ways to improve your service, its efficiency or the performance of your systems
- identify the root cause of an outage using data you collected during the outage
Set up monitoring early
Don’t leave monitoring to the end, tacked on as part of running the final production service.
Talk about monitoring early and agree on an approach, so you can build useful checks as you go along.
Writing tests at the same time as writing code is common. Treat your monitoring checks as tests for the running system.
Include high-level checks
Often monitoring is seen through a very technical lens, so teams may only look at web application performance, available disk space or memory usage.
Although these are important, you must also track them alongside more business-related metrics.
For example, comparing page-loading tests with failed transactions and application errors allows you to:
- find out about problems
- help identify the cause of problems
- ground conversations about low-level problems (disk space, slow performance) in relation to service performance
Record and track errors
When you find an error, record it and track it over time. Errors always contain interesting information - they can tell you about:
- a user problem
- attacks in progress
- failing systems
- problems with capacity
You need to be able to see errors that are:
- part of the overall system
- specifically related to a particular application or machine
Make data widely available
Make data from the following as widely available to everyone as possible:
- your monitoring system
- interactive tools
You may also find the Uptime and availability guide useful.
- Published by:
- Technology community (web operations)