Use of confidence intervals
Updated 27 September 2024
Applies to England
All NHS Breast Screening Programme (NHS BSP) staff with a quality assurance (QA) role must be able to assess performance and interpret performance data accurately. Understanding confidence intervals (or ‘confidence limits’) is a crucial part of that process. This information explains what a ‘confidence interval’ means and how it is calculated. Section 1 provides a general introduction adapted in part from material on the NHS website. Section 2 (written by Dr Roger Blanks) provides a more detailed explanation with worked examples.
Introduction to confidence intervals
All estimates involve a measure of uncertainty, because studies are conducted on samples and not on entire populations. A confidence interval is a way of expressing the precision of an estimate (or the uncertainty surrounding it). It is often presented alongside the results of a study.
In the Swedish two-county trial for example, screening significantly reduced the rate of deaths resulting from breast cancer. Women in the screening group had a 38% reduced risk of dying from breast cancer compared with those in the non-screened group. In the trial this was expressed as: ‘relative risk 0.62, 95% confidence interval [CI] 0.51 to 0.75’.
‘Relative risk’ compares a risk in 2 different groups of people. All sorts of groups are compared with others in medical research to see if belonging to a particular group increases or decreases the risk of developing certain diseases. This measure of risk is often expressed as a percentage increase or decrease, for example ‘a 20% increase in risk’ of treatment A compared with treatment B. If the relative risk is 300%, it may also be expressed as ‘a threefold increase’.
The most common interval (the 95% confidence interval) shows where we confidently expect the true result from a population to lie 95% of the time: in the Swedish two-county trial, the relative risk is expected to lie between 0.51 and 0.75. The narrower the interval or range, the more precise the estimate. A confidence interval of 95% certainty is usually considered high enough for researchers to draw conclusions that are sufficiently robust to be extended from the sample to the population as a whole.
In QA we assume that a sample (or observed) cancer detection rate based on a year’s data is an estimate of a radiologist’s true annual cancer detection rate. But the accuracy of that estimate depends on its denominator, for example on the number of women screened. If only a few hundred women are screened and the denominator is small, the cancer detection rate would owe a lot to chance: if many tens of thousands are screened then the denominator would be much larger and the role of chance would shrink.
We can use confidence intervals to examine the role of chance in a particular estimate. For example, a radiologist obtains a cancer detection rate of 2 per 1000, based on 20 cancers detected from 10,000 women. This is a reasonably large denominator and enables us to say with 80% certainty that the detection rate will lie between 1.4 and 2.6 per 1000. (For details of how this confidence interval is calculated, see below.) If the target rate is 4 per 1000 then we have sound evidence that this radiologist has a low cancer detection rate. On this basis we could infer that future detection rates might also be low, which would justify looking in more detail at the radiologist’s performance. But if the rate of 2 per 1000 were based on a much smaller denominator, say 1 cancer detected from 500 women screened, then its interpretation would be very different. The 80% confidence intervals would in this case be 0 to 4.6 per 1000; in other words, the true cancer detection rate could exceed the programme’s target of 4 per 1000, making detailed scrutiny of the radiologist’s performance unjustified. In both cases, the confidence intervals are used to indicate the range in which we are 80% sure that the true cancer detection rate of the radiologist lies.
Confidence limits for proportions
Cancer detection rates, recall rates and the positive predictive value (PPV) can all be thought of as proportions, though they are not described in that way. By using the normal approximation to the binomial distribution we can calculate a simple confidence limit that will enable us to determine if these rates and PPV are based on sufficient numbers to provide a reasonable estimation of performance. We can calculate either the 95% confidence limit routinely used for trials and other studies or a less stringent 80% confidence limit, which is arguably more useful for proactive QA. The formulae are as follows.
95% confidence limit:
80% confidence limit:
Example: Consider a reader who reads images from 3,000 women, of whom 180 are recalled and 30 have cancers detected. The cancer detection rate is 10 per 1,000 or 0.01 as a proportion. The recall rate is 6% or 0.06 as a proportion and the PPV is 16.7% or 0.167 as a proportion.
The 95% confidence limits around the cancer detection rate as a proportion are:
Or multiplying by 1000 to report this as a rate per 1000, the cancer detection rate is 10 per 1000 (95% CI 6.43 to 13.57 per 1000). Similarly, using the above equation, the 80% confidence limit is 0.01 + 0.00233 = 0.00767 to 0.01233, or multiplying by 1000 is 10 per 1000 (80% CI 7.67 to 12.33 per 1000).
We can interpret the 95% confidence interval as suggesting a 19 in 20 chance that the true value is between 6.43 per 1000 and 13.57 per 1000, while the 80% confidence interval suggests a 4 in 5 chance that the true value is between 7.67 per 1000 and 12.33 per 1000. In both cases the best estimate is 10 per 1000 and we can argue that this is a reasonable estimate.
What if the reader read 300 images, referred 27 women and detected 3 cancers? The cancer detection rate is still 10 per 1000, but the 80% confidence limits are 2.65 per 1000 to 17.35 per 1000. There is thus a 4 in 5 chance that the true cancer detection rate, after allowing for chance, is between 2.65 per 1000 and 17.35 per 1000. Of course, the 95% confidence limits are even wider and less precise; we can conclude that there is a 19 in 20 chance that the true value is between –1.3 per 1000 and 21.3 per 1000. The negative value occurs because the formula is not very good when the overall numbers involved in the study (n) become very small. A negative number of cancers detected is impossible, however, so any negative values are interpreted as a zero detection rate. This means that we are 95% sure that the true value is between 0 and 21 per 1000; a range so large that it effectively tells us nothing at all about the reader’s cancer detection rate because the numbers are so small that they relate more to chance than performance. So we can conclude that when small sample sizes (in this case only 300 women) are screened the cancer detection rate is not a useful measure.
Recall rate
Based on 27 women referred, the recall rate is 9%. Converting this to proportions and using the above formulae again we can calculate the 80% confidence limits as 7% to 11%. The recall rate is thus a more useful measure, even when based on relatively small numbers of women screened.
The formulae given above are most accurate with larger numbers and least accurate when the numbers are very small. When small numbers of women are screened and wide confidence limits are encountered we can conclude that the measurement is not useful. This demonstrates the clear advantage of reporting the confidence intervals in order to indicate not only the accuracy of the measure (the more narrow the confidence interval, the more accurate the measure) but also whether the measurement is useful. As noted, the 80% confidence limit may be more useful for proactive QA, even though the 95% limits are the most commonly used in scientific studies.