Monitoring the status of your service
By the time you reach public beta, you must have monitoring in place for your service to identify any problems that might affect it.
Monitoring with the right tools and processes allows you to:
- discover any problems that users have
- get alerts when technical problems occur so you can fix them
- anticipate problems before they happen or become more serious
- improve your service, for example by using performance data to help with capacity planning
Meeting the Digital Service Standard
You must monitor the status of your service to meet these points:
You’ll have to explain how you’ve done this in your service assessments.
Plan your monitoring
You should start planning how to monitor your service during alpha.
During alpha, your team should agree:
- what to monitor in your service
- how to monitor your service
- how to process and record issues
Metrics to monitor
You must track user-related metrics, as well as technical metrics. For example, track the percentage of users that can complete a task as well as available disk space, application programming interface (API) performance and memory usage.
How to monitor
Once you’ve agreed what to monitor, your team should:
- set up internal and external monitoring checks
- write monitoring checks
- write alerts
Setting up internal and external monitoring checks
You should set up internal and external monitoring checks.
Internal monitoring is the monitoring you should set up inside your infrastructure and will give you realtime updates about metrics like memory usage, page load times, and network traffic.
External monitoring is the monitoring you should set up outside of your service which keeps checking your systems even if your infrastructure goes down.
Writing monitoring checks
You need to decide the type of monitoring checks that are most useful to your service.
A monitoring check is a series of tests that you can run against your systems or overall service to assess their status and tell you if something is wrong.
For example, you might decide you need to see an alert if 1% of users in an hour have problems finishing a transaction.
You should write monitoring checks at the same time as writing code and treat your checks as tests for your live system.
Make your alert messages clear and concise. They need to be easy to understand for team members who might be woken up in the night to fix a problem.
Consider creating an operations manual or documentation to help your team deal with problems quickly. Make sure every member of your team has a local copy of the documentation in case your cloud-based documentation storage is unavailable.
Processing and recording issues
You should manage and track errors using a ticketing system that allows you to delegate them to members of your team.
Errors always contain interesting information - they can tell you about:
- a user problem
- attacks on your service
- failing systems
- problems with capacity
Tracking errors helps you to see which ones are recurring and whether they’re part of the overall service or related to a particular application or machine.
You can combine monitoring test results to better understand what to fix in your service. For example, comparing page-loading tests with failed transactions and application errors allows you to:
- find out the parts of your service where more users are having problems
- identify the cause of problems
- discuss how to fix the cause of problems, for example, disk space or slow performance
Make data widely available
Unless it’s not safe to do so, you should make monitoring information and data widely available.
For example, you can share performance reports with other service teams in your department or use a status dashboard, like the operations status page used by GOV.UK Notify, to tell users about any issues.
Reviewing your monitoring processes regularly
You should review your monitoring processes every time you get an alert.
If someone is called out of hours, you should make sure the issue needed that level of response.
For example, if the issue didn’t affect users and could have waited until the morning, consider changing your alert strategy so that type of error doesn’t prompt an alert in future.
You may also find the Uptime and availability guide useful.