Guidance

Interim guidance on incorporating artificial intelligence into the NHS Breast Screening Programme

Published 17 May 2021

Applies to England

This interim document has been developed to help facilitate the discussion about the use of artificial intelligence (AI) within the NHS Breast Screening Programme.

1. General background

The UK National Screening Committee (UK NSC) is responsible for making recommendations on major modifications to screening programmes.

The UK NSC has a published process for considering programme modifications.

The process registers the fact that screening programmes are not static and will be subject to change over time. The purpose of the process is to enable stakeholders to submit proposals to modify programmes. The process focuses on major modifications.

To initiate the process some basic information about the proposed modification is requested, including:

  • a statement on the element of the programme to be modified
  • published evidence addressing and supporting the outcomes expected from the modification in a relevant population – outcomes of interest may include increased test sensitivity or specificity, increased uptake of screening, increase or reduction in the number of screening rounds
  • published evidence addressing the wider pathway impact arising from the modification – these may include the identification of new types of abnormality as a consequence of a new test, identification of incidental findings not previously detected or revised cost effectiveness estimates
  • a flowchart to illustrate the flow of people through the various stages of the screening pathway including the expected outcomes
  • no more than 10 references addressing and supporting the modification

Proposals for modifications relating to the tests, or components of tests, used in screening programmes should address these information requirements. This includes AI in the NHS Breast Screening Programme.

The UK NSC is currently developing guidance focusing on modifications relating to the tests, or components of the tests, used in screening programmes. The aim is to produce an algorithm linking the type of modification to the kind of evidence required by the Committee. The purpose of this will be to assist those considering research with a view to modifying a programme and those who wish to submit proposals to modify a programme.

This document is an interim statement specifically aimed at AI for use in mammography made based on the discussion to date.

2. Test-specific background

Screening is more than just a test, but testing pathways are central to screening programmes. Modifications in this area could affect different points of the screening, diagnostic and surveillance pathway. Modifications of the testing pathway will vary in scale and significance.

Some technology providers, for example providers of AI products, are likely to develop improvements and upgrades throughout the life of the product.

Routine upgrades which do not fundamentally alter the performance of the test would represent a minor modification. As such, they would not require resubmission to the programme modification process.

Modifications of greater significance will require a reapplication to the programme modification process. Key considerations in this regard are accuracy and impact.

Some AI will be designed to improve over time. This represents a challenge to evaluation processes. Where AI learning is continuous periodic, system wide, updates following evaluation of clinical significance would be preferred to continuous updates which may result in ‘drift’. However, these products will require ongoing performance monitoring in line with the UK Government’s Code of conduct for data-driven health and care technology.

Two major categories of test related modifications are anticipated. Those which:

  • lead to approximately the detection of the same spectrum of disease as the existing test.
  • lead to the detection of a different spectrum of disease and so require evaluation of the changes to balance of benefits and harms of screening

The UK NSC needs to be in a position to distinguish between these basic categories of test related modifications. The Committee also requires an evidence base, which is sufficient in terms of volume, type and quality, to be in a position to evaluate the proposed modification.

The major modifications covered by this process are likely to be a step change from the current tests. So, interaction and advice from clinical leaders (through programme and UK NSC advisory structures) as well as stakeholder engagement will be needed.

3. Evidence requirements for artificial intelligence in breast screening

Whilst it is acknowledged that AI systems contain complex algorithms that cannot be easily displayed or described in detail, some attempt should be made to describe their basic functionalities, including which personal characteristics (such as age) and imaging features are being used.

In terms of performance measures, analytic validity is a necessary prerequisite but will not be a sufficient level of evidence for the UK NSC to recommend AI for use in the screening programme. Where AI detects the same spectrum of disease in approximately the same proportions as the current screening programme, only evidence on clinical validity is necessary. Where a test detects a significantly different spectrum of disease, clinical utility will also be assessed.

In all evaluations a range of study designs may be included, and considerations of study quality, bias and confounding will be considered as per the standard UKNSC processes.

3.1 Test accuracy (clinical validity)

Test accuracy estimates should be obtained from studies using un-enriched UK datasets which were not used in any way to train the system, with pre-set algorithms that do not change during the study. These datasets should be representative of the target population for screening.

PHE screening is aware that a set of test cases will be required for this purpose.

Where possible direct (sometimes called paired or within-person) comparisons of accuracy should be made between different AI systems on the same women, with double reading plus arbitration as the comparator. Test sets should be sufficiently large such that the lower end of the 95% confidence intervals for both sensitivity and specificity represent acceptable performance.

Accuracy to detect cancer overall should be reported, using biopsy proven screen detected and interval cancers as the reference standard. Analyses at clinically relevant thresholds are more useful in evidence review than area under the ROC curve. Accuracy to detect cancer subtypes (such as by grade, prognostic indices, size, nodal involvement etc), and detection of interval cancers (detected symptomatically in the interval between screening rounds, which may have been missed by human readers) should also be reported. In the evidence review the UK NSC will consider not only accuracy, but the types of cancer detected, and the potential benefits and harms associated with each. Potential for harm through overdiagnosis of clinically insignificant disease will be considered in the evidence review.

Consideration will be given to which digital mammography systems the evidence for each AI system is generalisable, and it is possible that if there is only sufficient evidence of accuracy or impact in use with a single digital mammography system it may only be recommended for use with that system. Our preference is for AI systems which are generalisable to all systems used in the UK.

Detection of interval cancers is a key outcome, because preventing interval cancers is associated with benefit, and by definition is not associated with harm from overdiagnosis. Therefore, prediction of interval cancers as an outcome should be a priority. Improvement in specificity is also a key outcome of interest to the UK NSC.

Analysis test accuracy of AI by population subgroup (such as age, ethnicity) would be welcomed by the UK NSC.

If the role of the AI system involves interaction with the mammography readers (radiologists, radiography advanced practitioners or breast clinicians) then accuracy will depend on this interaction. Therefore, a recommendation for implementation will require additional evidence of how the AI system performs in practice, interacting with the mammography reader. If this is calculated using test sets then there is a high risk of bias due to behaviour differing in test sets vs clinical practice, particularly if the test set is enriched with extra cancer cases. This is unlikely to be sufficient for a recommendation for implementation.

An example of a less biased study design to evaluate screening test accuracy where there is human interaction is randomised test accuracy studies (sometimes called randomised controlled trials of test accuracy) where women are randomised to receive either the existing pathway (double reading plus arbitration) or the new proposed pathway which includes AI. Sometimes these studies have multiple intervention arms for different AI systems, which is preferable for evidence review. These studies measure test accuracy in real world settings and provide the opportunity to measure differences in interval cancers.

Examples of roles where there is substantial interaction with the mammography reader include assisting the mammography reader in interpreting the mammograms, by providing prompts to examine certain areas, or probability of abnormality. In these cases, ergonomics (human factors) has a significant impact on test accuracy.

Examples of roles where there is some human interaction include:

  • to act as the second reader, with disagreements resolved by arbitration.
  • to identify women whose mammograms indicate a very low risk of cancer, and therefore do not require radiologist review then they may impact radiologists behaviour through changes to prevalence, case mix, or expectation
  • to replace arbitration, which may then affect radiologists’ behaviour through changing what is recalled by readers 1 and 2 through expectation changes regarding arbitration

3.2 Test impacts (clinical utility)

In some circumstances test accuracy alone is insufficient evidence for implementation, and additional evidence is required.

If the spectrum of disease detected using the AI system significantly differs from current clinical practice, then the benefits and harms of detecting the different spectrum of disease must be evaluated. Rate of interval cancers or longer term follow-up to clinical outcomes would be the only acceptable outcomes in these circumstances. For example, increased detection of low grade DCIS would require investigation of the potential to increase overdiagnosis. The ideal evidence would be from test-treat trials randomising women to either the standard pathway (2 readers plus arbitration) or the new proposed pathway, with follow-up to clinically significant outcomes. In some circumstances it would be acceptable to use a linked evidence approach to explore the benefits and harms of detecting a significantly different spectrum of disease.

The impact of the AI system on workflow and capacity will be considered in the evidence review. This includes measurement of downstream consequence of AI, such as false positive results and referral rates to subsequent testing and treatment pathways. Any AI system that results in increased recall rates is unlikely to be cost effective due to the costs and time associated with assessment of women recalled with screen positive results.

In the event of test-treat trial evidence showing an AI system is superior and more cost effective than double reading plus arbitration, that system may be recommended for use.

The standard of evidence for alternative AI systems from the same or different manufacturers will be as follows. Where the alternative AI system has demonstrated equivalent or improved test accuracy and detection of approximately the same or more clinically significant spectrum of disease to the previous AI system in a UK based within-person test accuracy study, a test-treat trial will not always be necessary for recommendation of the new AI system. Instead a linked evidence approach will be considered. This is contingent on considerations of human interaction (where there is substantial interaction a linked evidence approach is often inappropriate) and considerations of whether clinician behaviour would differ when using the alternative AI system.

Training mechanisms should, where possible, prioritise learning from interval cancers, and should not learn equally from different cancers subtypes to avoid the potential for drift towards detection of less clinically significant disease.

4. Incorporation and piloting

Where a major change is proposed and approved by the UK NSC, it is good practice to pilot/evaluate in the service to assess how it will work in practice and establish any unexpected effects, piloting is necessary.

Because automated systems may replace human staff, any new AI system may have an irrevocable impact on NHS capacity for that specific job. Ongoing effectiveness needs to be established before staff are released for other tasks.

Pilots should be undertaken as evaluation of the product to demonstrate that the technology works as described and to identify potential pitfalls in practical deployment. This is in contrast to generating new knowledge regarding test accuracy or clinical effectiveness measures.

Piloting should be delivered across a range of services which cover a variety of populations to ensure that the learning from the pilot is applicable to the system as a whole. Prior to piloting, a clear evaluation plan should be established which clearly states the outcomes of interest and assures that:

  • the patient information and consent processes are satisfactory (including GDPR compliance with requirements for automated processing)
  • the necessary requirements for local (or outsourced) IT systems exist, including support, security, data capacity and IG compliance
  • the system is functioning as expected
  • incident response plans are satisfactory
  • outputs are accurate and are achieving satisfactory quality standards
  • staff redeployment is successful

It is also worth considering indemnity issues over AI systems when used within an NHS trust, such as whether the AI manufacturer is liable for decisionmaking mistakes that have negatively impacted women, and the mistake is due to coding or other direct software errors.

Evaluation plans should be established in advance and approved by the relevant bodies in the countries in which the pilot will be undertaken. For example, in England the PHE Screening Division (Research Advisory Committee) and local commissioners will be asked to approve the pilot plan. A final evaluation report should be produced by the oversight group. This will be considered by the UK NSC and should be circulated for comments to the service commissioners and PHE Screening Division.

5. Research governance

In England, all research studies which require support from screening programmes should be submitted to the relevant PHE Screening Division Research Advisory Committee (RAC).

The RAC’s role will be to:

  • confirm the validity of requests for access to relevant datasets for researchers,
  • review research applications and facilitate important projects which warrant screening programme support
  • identify and highlight potential adverse effects of projects
  • assure that ethical approvals and governance are in place