A/B testing: comparative studies

Question

Accepted Answer

This page is part of a collection of guidance on evaluating digital health products.

A/B testing involves comparing 2 different versions of a design to see which performs better. It helps you understand how the differences between the 2 versions affect users’ behaviour and outcomes.

A/B testing is useful for any digital health service that uses an app, website or newsletter campaign.

What to A/B test

You can A/B test almost anything that affects visitor behaviour, for example:

headlines and text – including length, structure and position on the page
content – including tone and language
calls to action – including wording, size, colour and placement
forms – including length, fields and descriptions
images – including the choice between cartoon or realistic pictures

Pros

Benefits include:

they let you explore different ideas and then make changes based on quantitative data
they can produce definitive answers because randomisation makes sure that participants in each group are similar

Cons

Drawbacks include:

A/B tests can be technically complex to set up
you will need many users for the data to be statistically significant

How to carry out A/B testing

An A/B test is like a randomised controlled trial for design choices.

Identify problem areas in your intervention. Construct a hypothesis to test or a goal you want to reach.

Create a control and variation. Create a new variation of your intervention that could help prove or disprove your hypothesis, then test it against the original. If you’re testing an existing web page or app, your control will be the original version – version A. The new version will be version B. Decide your sample size, if necessary. Decide how long to run the test for. You should base this on the number of monthly visitors, the expected change in the user behaviour and other factors.

Split your sample groups equally and randomly.

Analyse the results of your A/B test, comparing the outcomes of the original version against the variation. If there is an obviously better option from the A/B test, you can implement it. If the test results are inconclusive, you should review your hypothesis or goal, come up with new variations and continue A/B testing.

Example: A/B testing of digital messaging to cut hospital non-attendance rates

See Senderey and others (2020): It’s how you say it: Systematic A/B testing of digital messaging to cut hospital no-show rates.

A group in Israel wanted to reduce the proportion of people who did not attend their hospital appointments. Patients with appointments were already sent SMS text reminders 5 days before their appointment. The team developed 8 new wordings and compared all 8 against the previously used generic message. This was a 9-way comparison, not just testing A versus B.

Non-attendance at appointments costs health services money and there have been numerous initiatives to try to reduce the non-attendance rate. Text reminders have been shown to be effective and cheap in various trials, although their effects are often small. The team wanted to see if the wording of the text could be improved to have more effect.

The original text message (translated from Hebrew) was:

Hello, this is a reminder for a hospital appointment you have scheduled. Click the link to confirm or cancel attendance to the appointment.

The team drew on behavioural economic theory to produce different motivational narratives for the text messages. For example, a message based on evoking a social identity read:

Hello, this is a reminder for a hospital appointment you have scheduled. Most of the patients in our clinic make sure to confirm their appointment in advance. Click the link to confirm or cancel attendance to the appointment.

The evaluation took place over 4 months across 14 hospitals. Patients receiving reminders were randomly allocated to receive one of the 9 different messages. In total, 161,587 reminders were sent within the evaluation. The proportion of non-attendances in each group was recorded, along with various background variables. With 9 options and over 150,000 reminders, this was a large A/B test. A/B testing is often carried out on a much smaller scale.

Non-attendance rates varied from 21% in the control group (receiving the original text) down to 14% in the emotional guilt message, which read:

Hello, this is a reminder for a hospital appointment you have scheduled. Not showing up to your appointment without cancelling in advance delays hospital treatment for those who need medical aid. Click the link to confirm or cancel attendance to the appointment.

In all, 5 of the alternative messages showed statistically significantly lower non-attendance rates. The hospital group involved in the evaluation have now switched to using the emotional guilt message. They are monitoring what happens to their non-attendance rate. However, the evaluation team also note that their study focused on the non-attendance rate and could not assess other outcomes. For example, could this phrasing reduce patient satisfaction?

More information

Kohavi and Longbotham (2017): Online Controlled Experiments and A/B Testing.

Groth and Haslwanter (2016): Efficiency, effectiveness, and satisfaction of responsive mobile tourism websites: a mobile usability study. The authors conducted A/B testing to compare 2 versions of a website.

A/B testing: comparative studies

What to A/B test

Pros

Cons

How to carry out A/B testing

Example: A/B testing of digital messaging to cut hospital non-attendance rates

More information

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK

Cookies on GOV.UK

A/B testing: comparative studies

What to A/B test

Pros

Cons

How to carry out A/B testing

Example: A/B testing of digital messaging to cut hospital non-attendance rates

More information

Updates to this page

Sign up for emails or print this page

Related content

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK