GOV.UK Data Labs (Cabinet Office): Related Links

Related Links is a recommendation engine built to aid navigation of GOV.UK by providing relevant onward journeys from a content page.

Tier 1 – Overview

Name

GOV.UK Related Links

Description

Related Links is a recommendation engine built to aid navigation of GOV.UK by providing relevant onward journeys from a content page.

The tool uses an algorithm called node2vec to train a model on the last 3 weeks of user movement data (web analytics data). The model is used to predict related links for every page. These new related links are published to GOV.UK.

The tool is used to help users find useful information and content, aiding navigation.

GOV.UK has approximately 600,000 pieces of content. Previously, related links were created manually, with only approximately 2,000 pieces of content having related links. The tool expanded that to nearly the entirety of the gov.uk content.

For further questions:

Ganesh Senthi ganesh.senthi@digital.cabinet-office.co.uk

URL of the website

https://github.com/alphagov/govuk-related-links-recommender

https://apolitical.co/solution-articles/en/machine-learning-government-algorithm

https://snap.stanford.edu/node2vec/

https://dataingovernment.blog.gov.uk/2019/06/19/a-or-b-how-we-test-algorithms-on-gov-uk/

Contact email

GOV.UK Data Labs:

ganesh.senthi@digital.cabinet-office.gov.uk mohamed.abdisalam@digital.cabinet-office.gov.uk

Tier 2 – Owner and Responsibility

1.1 Organisation/ department

GOV.UK - Government Digital Service

1.2 Team

GOV.UK Data Labs (part of the Data and Insights Group, Product and Technology Office)

1.3 Senior responsible owner

n/a

1.4 Supplier or developer of the algorithmic tool

The tool was built in-house using open source tooling. No external organisation was involved.

1.5 External supplier identifier

n/a

1.6 External supplier role

n/a

1.7 Terms of access to data for external supplier

n/a

Tier 2 – Description

2.1 Scope

The tool was designed to populate most pages on GOV.UK with up to five related links. Every user sees the same links; the related links are not personalised.

2.2 Benefit

The benefit of the tool is that it predicts related links for a page. These related links are helpful to users. They help users find the content they are looking for. They also help a user find tangentially related content to the page they are on; it?s a bit like when you are looking for a book in the library, you might find books that are relevant to you on adjacent shelves.

2.3 Alternatives considered

Previous manual effort of deciding related links led to only 2,000 pages on GOV.UK having related links and those that did remained static unless manual effort was made to update them.

98% of pages on GOV.UK did not have related links.

2.4 Type of model

We used node2vec, which is a machine learning algorithm that learns network node embeddings.

The way users move around GOV.UK is represented as a graph and is used as input by the algorithm. The nodes represent pages and the edges represent user movement (where an edge exists if at least 5 ?users? moved between those nodes in the last three weeks). The hyperlinks between pages were also included as edges. We train a model using three weeks of user movement data. This model can then be used to predict related links for a page (cosine similarity is used to identify similar pages).

Refer to the blogs and the Github repository above for extra detail.

2.5 Frequency of usage

The tool updates links every three weeks and thus tracks changes in user behaviour.

The average click through rate for related links is about 5% of visits to a content page. For context, GOV.UK supports an average of 6 million visits per day (Jan 2022). True volumes are likely higher owing to analytics consent tracking.

2.6 Phase

The tool is in production.

Date first live: May 2019

2.7 Maintenance

We developed a way for publishers to add/amend or remove a link from the component. On average this happens two or three times a month.

Every 3 weeks, the machine learning algorithm is trained using recent data and trains a model that outputs related links that are published, overwriting the existing links with new ones.

Publishers can add/amend/remove a link from the component. This manual intervention can be temporary or permanent if suggested to deny list.

2.8 System architecture

https://github.com/alphagov/govuk-related-links-recommender

Tier 2 – Oversight

3.1 Process integration

The decision process is fully automated.

3.2 Provided information

n/a

3.3 Human decisions

Humans have the capability to recommend changes to related links on a page. There is a process for links to be amended manually and these changes can persist.

3.4 Required training

The tool is deployed automatically every three weeks. Humans aren?t in the loop regarding deployment, it?s automatic.

3.5 Appeals and review

GOV.UK has a feedback link, ?report a problem with this page?, on every page which allows users to flag incorrect links or links they disagree with. Publishers are also able to submit pages for a deny list or temporary removal.

Tier 2 – Information on data

4.1 Source data name

Web analytics data exported from Google Analytics to BigQuery (a data warehouse) for querying. This data was aggregated and provided the ?user movement data?. The hyperlinks between pages were also used as input for the training of the algorithm.

4.2 Source data

The node2vec algorithm takes user movement data as input. This is derived from our web analytics data of how users move around the site. This data is represented as a graph, where the nodes are pages and the edges are user movement between those pages. Edges are ignored if they have fewer than five movements in the last three weeks. The hyperlinks between pages are also included in this graph as edges.

The data is used to train the model (there are plenty of blogs online that explain how node2vec works). The model is used to make predictions for each page about what pages are most similar to the page of interest. Using the cosine similarity distance we produce a sorted list of pages that should be recommended for each page. We use the top five most similar pages as the related links. Some pages cannot be output as related links, these include pages that might be deemed insensitive such as Air Accidents Investigation Branch reports. A curated deny list is in place, which was compiled using content designer expertise and feedback.

4.3 Source data URL

n/a

4.4 Data collection

The purpose of conducting web analytics on GOV.UK is to enable GOV.UK / GDS to obtain a comprehensive view of how people interact with GOV.UK and to identify improvements that can make those interactions simpler and easier for users.

This is achieved by collecting performance analytics data from GOV.UK visitors that have provided their consent into a GOV.UK-managed Google Analytics account.

In order to achieve this, GOV.UK deploys some Google Analytics code which places cookies on a user?s device and sends the performance analytics data to the GA account. Each cookie placed in a browser is assigned a unique ClientID allowing its activities to be tracked across multiple GOV.UK visits.

GOV.UK analysts then analyse the performance data to identify potential improvements that can be made to GOV.UK.

The GOV.UK web analytics data is used for secondary purposes like related links. We automatically export this data to BigQuery on Google Cloud Platform, a data warehouse. We query this data warehouse and use aggregated data to train a model.

See 5.2 for more detail.

4.5 Data sharing agreements

n/a

4.6 Data access and storage

Following from 4.4

Tier 2 – Risk mitigation and impact assessment

5.1 Impact assessment name

A Data Protection Impact Assessment exists for GOV.UK Web Analytics generally. Related links makes use of this anonymised data and aggregates it, thus it was considered to sit under this DPIA.

5.2 Impact assessment description

The impact assessment mentioned in 5.1 concluded that the purpose of conducting web analytics on GOV.UK is to enable GOV.UK / GDS to obtain a comprehensive view of how people interact with GOV.UK and to identify improvements to make those interactions simpler and easier for users.

This is achieved by collecting performance analytics data from GOV.UK visitors that have provided their consent, into a GOV.UK-managed Google Analytics account.

In order to achieve this GOV.UK deploys some Google Analytics code, which places cookies on a user?s device and sends the performance analytics data to the GA account. Each cookie placed in a browser is assigned a unique ClientID allowing its activities to be tracked across multiple GOV.UK visits.

GOV.UK analysts then analyse the performance data to identify potential improvements that can be made to GOV.UK.

The GOV.UK web analytics data is used for a secondary purpose - creating related links on most GOV.UK pages.

5.3 Impact assessment date

n/a

Internal document DPIA21-4581900.

5.5 Risk name

A recommendation engine can produce links that could be deemed wrong, useless or insensitive by users (e.g. links that point users towards pages that discuss air accidents).

5.6 Risk description

A recommendation engine can produce links that could be deemed wrong, useless or insensitive by users (e.g. links that point users towards pages that discuss air accidents).

5.7 Risk mitigation

We added pages to a deny list that might not be useful for a user (such as the homepage) or might be deemed insensitive (e.g. air accident reports).

We also enabled publishers or anyone with access to the tagging system to add/amend or remove links.

Published 29 February 2024