Official Statistics

Linking police and health data on road collisions: an initial feasibility study

Published 25 September 2025

Applies to England

About this report

This report summarises initial work to link the Department for Transport’s data on road casualties reported to police (known as STATS19) with healthcare data, in what is hoped will be the first stage of a wider project. This report demonstrates the feasibility of linking police and ambulance data for a small subset of data for the South West region.

1. Main findings

The coverage of the two datasets linked is different, and in particular, STATS19 data relates only to those collisions which occur on the public highway, with some incidents excluded as set out in the guidance on completing STATS19. Therefore, it is not possible to draw definitive conclusions from this analysis as it is not always possible to determine whether incidents in the ambulance data were in scope, and the collisions studied are not representative of all those that occur across the whole country.

However, this feasibility study demonstrates that:

  • despite a lack of common identifiers, it is possible to link STATS19 data with ambulance service records based on location, time and non-identifiable patient characteristics (age and sex)
  • around half of casualties in the ambulance service dataset that appear to be in scope of the police collection are linked to STATS19, suggesting police data understates the number of road traffic casualties, even for what are likely to be relatively more serious collisions
  • in particular, pedal cyclists appear to be less well recorded in the police data, particularly when no other party is involved in the collision

Although these findings are consistent with previous related studies, including the department’s own work to link police and trauma care data, this is the first study completed in collaboration with the Pre-hospital Research and Audit Network, PRANA, and it is hoped to develop this work to link to more health datasets, for a longer time period, and for a wider geographic area.

2. Background

Currently, the main dataset for monitoring trends and patterns in road casualties in Great Britain is based on data collected by police, via a system known as STATS19. This data provides the basis for DfT’s published road casualty statistics. While overall the police reported data provides a valuable dataset for analysis:

  • it has long been known that STATS19 is not a complete record of non-fatal road collisions (even for those that are in scope)
  • STATS19 lacks detail on post-collision outcomes, including detailed clinical severity

Routinely linking police data with other datasets, notably health data, would help to address these limitations. However, while DfT and others have successfully linked STATS19 to individual datasets in the past, there does not currently exist a sustainable, comprehensive linkage of this nature, due to the challenges set out by the RAC Foundation in their report ‘Joining the Dots’

To address these challenges, a working group including representatives from NHS bodies, DfT and other interested stakeholders was established, with a joint statement published supporting the PRANA work. One of these outcomes included working with the PRANA project to develop data linkage for road traffic collision (RTC) casualties.

2.1 PRANA

PRANA is a national registry, based within the NHS Wessex Secure Data Environment (SDE), which collects and links care pathway information on ill and injured patients and is funded by several organisations [footnote 1] including NHS England. While the scope of PRANA is wider than those injured following an RTC, PRANA is leading work on two RTC-related projects, which are:

  • Data Sustains Life (DSL), an in-depth study being funded by the Road Safety Trust and DfT, carried out in conjunction with TRL, to prospectively link health data with the Road Accident In-Depth Studies (RAIDS) database
  • Linking Police and Hospital Data on road casualties (LPHD), work with DfT to link historic data from STATS19 with a range of health datasets.

To facilitate these projects, PRANA has obtained necessary approvals, including from the Healthcare Research Authority (HRA) and NHS England, and has created a workspace within the NHS Wessex SDE for analysis of specific anonymised data which can be accessed by approved researchers, including the DfT road safety statistics team. More detail on the HRA support that the PRANA programme is working within is available at www.prananet.org.

2.2 Aims of Linking Police and Hospital Data on road casualties (LPHD

The LPHD project aims to retrospectively link STATS19 data to available healthcare datasets to improve the quality of published road safety data and statistics and ultimately support the development of improved road safety policies, with the specific aims including:

  • to assess completeness of police recorded data and whether this varies over time and by police force to validate trends and patterns
  • to assess accuracy of injury severity recorded by police: this could inform guidance provided to reporting officers, as well as publication of statistics based on clinically-recorded severity
  • to facilitate new insight from road casualty data (including published statistics) including on longer-term outcomes and cost of collisions

2.3 Aims of this study

This study represents the first stage of the LPHD project. It aims to establish the feasibility of data linkage within the PRANA secure environment, and to provide initial insight into the first 2 of the wider aims.

Further work is planned to build on this initial study, once more data is available within PRANA, as summarised in the next steps section below.

3. Data sources

This work links police data from STATS19 with a subset of ambulance data as provided to PRANA by the South West Ambulance Service NHS Foundation Trust, SWAST.

3.1 Ambulance data

The data provided by SWAST relates to incidents attended by ambulance service enhanced and critical care assets, which are helicopter and rapid response car teams equipped to deliver advanced trauma care to patients at the scene to improve chances of survival, following 999 calls. This represents a subset of all incidents at which ambulances attend, and is likely to cover relatively more serious incidents.

The data covers incidents attended in the SWAST region, broadly the South West of England, from 2017 onwards. To ensure a manageable number of cases for this pilot study, the period 2019 to 2022 was analysed. 

As RTCs are not directly identified, the dataset provided was filtered by the PRANA team, based on the incident description, to include only RTC cases. This resulted in a total of 2,376 records, of which 1,274 were in scope for this study, that is, occurred between 2019 and 2022. This includes some records unlikely to be in scope of STATS19, for example where incidents occurred off-road).

These records had all personal identifiers removed by the PRANA team prior to making the anonymised data available to the DfT road safety statistics team within a workspace inside the NHS Wessex SDE. This data flow was achieved according to the support provided by the HRA.

The dataset was cleaned for linkage by removing records with missing data for key linkage variables (location, patient age or patient sex) and likely duplicate records were flagged. Following this data cleaning, a total of 1,150 records were used for record linkage.

3.2 Police data

The police data used was a subset of the STATS19 dataset published by DfT, and download into PRANA via the STATS19 R package. Details of the collection and coverage of STATS19 are available from the background quality information. For this study, it is important to note that STATS19 covers incidents on the public highway, and exclude some casualty types including suicides. Full details of the coverage of STATS19 are set out in the STATS20 guidance.

4. Linkage methodology

Main finding: This initial study demonstrates that it is possible to achieve reasonable results with a relatively small number of linkage variables, all of which (for STATS19) are available in the published open dataset, though further work could be done to explore scope for improvement.

There is no unique identifier common to both datasets, however there are common variables which should have sufficient power to determine between true and false record linkages, and these have been used to link the datasets for this study.

4.1 Linkage variables

There are several variables common to each dataset that can be used as the basis for linkage, including:

  • location (recorded as easting and northings in each dataset)
  • date and time (of call in SWAST data, and of collision in STATS19)
  • casualty age and sex (or gender in SWAST, which is considered equivalent for linkage)

While collision location recorded in STATS19 is known to sometimes be imprecise, it is well-populated and the combination of time and place should be sufficient to identify the same collision in each dataset, with the assumption that serious incidents (as attended by ambulance service enhanced and critical care assets) are relatively rare – that is, it is unlikely that there will be two (different) serious collisions in a similar place, at a similar time. Casualty details can then be used to identify the most likely correct casualty within a collision where there may be more than one.

4.2 Approach

For this linkage, the availability of a detailed location in each dataset means that a ‘rules-based’ approach has been used, rather than a fully probabilistic method. The approach was firstly to generate candidate matches, based on exact agreement:

  • date
  • casualty sex

These candidate matches were then selected based on the difference in the remaining linkage variables, namely distance (calculated as the straight-line distance between the two locations, in miles), time (in minutes) and age (in years), with the following thresholds and precedence:

  • distance within 2 miles
  • time within 120 minutes
  • age within 5 years

The final step was to de-duplicate. For example, the same STATS19 casualty record may be the closest match for more than one SWAST patient record. To remove these, the record with the closest match (as defined above) was chosen. The same approach was used to remove any cases where the same SWAST record represented the best match for multiple STATS19 records (this was more common, often where there were several casualties with similar ages in a collision in STATS19, only one of which was in the SWAST data).

4.3 Results

A total of 379 SWAST records were linked (33% of the 1,150 total records). For these records, the agreement on the linkage variables is shown in chart 1; in each case, there was a relatively little difference. The exception was for time difference, where the times in SWAST appear to be recorded in GMT, causing a discrepancy of around an hour for incidents during the months when BST applies (which was not adjusted for prior to linkage).

Chart 1: Degree of agreement on linkage variables, for final linked records

4.4 Validation

The linkage methodology was assured by manual inspection of the linked and non-linked SWAST records, to assess both the likelihood of incorrect links (‘bad matches’), and incorrect non-links (‘missed matches’). Further details are provided in the annex.

Based on this review, the likelihood of a linked record representing the same casualty is deemed to be high, of the order of 98 to 99% of the linked records. The extent of the missed linkages is likely to be higher, which should be kept in mind when interpreting the data. These results are broadly comparable with previous work carried out by DfT, for example when linking STATS19 to trauma care data, and are considered to be sufficiently robust for this initial analysis – though further work might deliver further improvements, in particular to capture some of the missed linkages.

4.5 Representativeness within STATS19

It should be noted that, due to the nature of the SWAST data, these linked records are not representative of the full STATS19 dataset, for example covering a small proportion of incidents in small number of police force areas as shown in table 1.

Table 1: Linked records by police force area, number and share of total

Police force         Linked records Total STATS19 killed or seriously injured (KSI) casualties Linked records as a % KSIs
Devon and Cornwall             107                                                      2,969                         3.6%
Avon and Somerset               97                                                      1,856                         5.2%
Gloucestershire                 29                                                      1,322                         2.2%
Wiltshire                       54                                                      1,242                         4.3%
Dorset                          82                                                      1,650                         5.0%
Others                          10                                                                                        
Total                          379                                                      9,039                         4.2%

As a result, the incidents in the SWAST dataset cannot be considered representative of STATS19 as a whole. For example, the linked casualties are more likely to:

  • be killed or seriously injured (80% compared with 21% in STATS19 as a whole)
  • occur in rural areas (78% compared with 36%)
  • occur on roads with speed limits of 60mph or above (47% compared with 21%)
  • occur away from junctions (61% compared with 43%)

These differences, which reflect the nature of incidents attended by ambulance service enhanced and critical care assets (which are more likely to be relatively serious) as well as the fact that the dataset relates to areas in the South West (which, for example, may be relatively rural), should be kept in mind when interpreting the results in the following sections.

5. Results: linkage rate

Main finding: Based on this small subset of data, roughly half of the casualties recorded in the ambulance data which were deemed to be within scope of police reporting were linked to STATS19, with a lower rate of linkage for pedal cyclists, particularly when no other vehicle was involved in the collision.

The first aim of this work is to use the linked data to explore the completeness of the police data, by exploring how many of the casualties in the SWAST data, that are within scope of STATS19, are linked. This assumes that the linkage is entirely accurate, and that the SWAST records within scope of STATS19 can be identified, neither of which are likely to be true. However, any discrepancies in the linkage, or the identification of RTC casualties in the SWAST data are considered unlikely to invalidate the broad conclusions drawn.

5.1 Overall linkage rate

Overall, 33% of the records in the SWAST dataset were linked to STATS19, however, it is likely that many of these patients are injured in collisions which are outside the scope of STATS19, which relates to collisions on the public highway involving at least one vehicle, and excludes some types of incident (including, for example, where a person dies as a result of a medical episode in a vehicle)[footnote 2].

It is not possible to determine with certainty which patients recorded in the SWAST dataset were injured in collisions which were in scope of STATS19, as sometimes details were limited to brief phases such as ‘RTC’. However, a best guess was made based on the details available (in particular, free text descriptions), and on this basis, around half of the patients likely to be in scope were linked to STATS19.

At face value this would suggest that STATS19 contains half of the records it should do, for these relatively more seriously injured casualties. However, this is a small (and unrepresentative) subset of the overall number of collisions, and this conclusion assumes the record linkage is accurate. Therefore this finding is best considered as broadly indicative and a basis for further work.

Table 2: Number and proportion of linked records, by whether collision is STATS19 reportable

Road traffic collision deemed within scope of STATS19? Linked to STATS19 Not linked Total SWAST records % linked
Likely in scope                367         395                  762 48%
Not in scope                   2         291                  293        1%
Uncertain                  10          85                   95       11%
Total                  379         771                 1150  33%

The focus of this analysis is on those records which are potentially in scope of STATS19, so that the following sections are based on the 762 such records. The remaining records – around a third of the SWAST data – may be of interest in understanding the relative share of non-STATS19 reportable collisions involving vehicles, but are not further analysed here.

5.2 Factors affecting linkage rate

It is possible to explore which factors in the SWAST data are associated with relatively lower rates of linkage. Assuming that the linkage is sufficiently robust, this would indicate the types of collisions – within this subset – which may be more likely to be under recorded in the police data.

Road user type: chart 2 shows the variation in linkage rate by the road user type of the casualty, as coded from the incident details within SWAST. This shows a notably lower proportion of records linked when the patient was a pedal cyclist, consistent with previous studies.

Chart 2: Proportion of records linked by road user type

Figures in brackets denote the total number of records in the dataset for each road user type.

Road user type [number of records] Proportion linked
Motorcycle rider [212]                              55%
HGV occupant [11]                                   55%
Car occupant [295]                                  51%
Bus occupant [2]                                    50%
LGV occupant [11]                                   45%
Other [16]                                          44%
Pedestrian [96]                                     44%
Pedal cyclist [96]                                  36%
Unknown [23]                                        17%

Number of vehicles in collision : The nature of the collision also appears to be associated with different rates of linkage, in particular a lower proportion of single-party collisions (those not involving collision with another vehicle, or a pedestrian) are linked – overall, 26% compared with 56% of collisions involving more than one vehicle, or a vehicle hitting a pedestrian.

In particular, there is a notably low rate of linkage for single vehicle pedal cycle collisions; again, this is consistent with the findings of previous studies.

Chart 3: Proportion of records linked by selected road user type and type of collision

Road user type Single vehicle collision Multi-vehicle collision
Car occupant                         39%                      57%
Motorcyclist                         21%                      63%
Pedal cyclist   5%  59%

Time period: The linkage rate was explored by year (as shown in table 3), month and hour of day. There were no clear patterns seen, and the number of records becomes relatively small when broken down in this way, so that no clear conclusions can be drawn. The number of records is lower for 2020, a period which includes COVID-19 related lockdowns which are known to be associated with a lower number of road casualties during that period.

Table 3: Number and proportion of linked records, by year

Year Linked to STATS19 Not linked Total SWAST records % linked
2019                 94          90                  184       51%
2020                 76          78                  154       49%
2021                 94          98                  192       49%
2022                103         129                  232       44%

Patient demographics: The majority of patients in the SWAST dataset were male (555, compared to 207 female), reflecting the fact that more RTC casualties are men. However, there was little difference in the rate of linkage by gender. Similarly, there was not an obvious pattern in linkage rates by age.

Severity (based on STATS19 definitions): The nature of the incidents in the SWAST dataset – those where advanced clinical care teams are sent – means that they are more likely to fall within those considered serious (or fatal) within STATS19. However, from the descriptions, there were a number where, for example, there was severe damage to vehicles, but the injuries sustained were relatively minor – and thus coded in STATS19 as slight.

For the unlinked records, an attempt was made to assign severity based on the STATS19 definitions[footnote 3] as far as possible, though this would benefit from further work. However, based on this initial crude assessment, there is a notably higher linkage rate for those who are killed. Indeed, it is believed that the majority of fatalities are recorded within STATS19, and the majority of those that were not linked appear to be related to death from medical episodes which would not be within scope.

Table 4: Number and proportion of linked records, by estimated severity

Estimated severity (based on STATS19 definitions) Linked to STATS19 Not linked Total % linked
Fatal                                                               34          12     46       74%
Serious                                                            227         171  398       57%
Slight                                                              45          39     84       54%
Uncertain                                                           57         167  224       25%

5.3 Conclusion

This is an initial feasibility study, based on a small dataset which is not representative of the country as a whole, so that the results should be interpreted with caution. However, the findings above are broadly consistent with related studies and suggest:

  • a non-trivial proportion of non-fatal road traffic collision casualties are not recorded in police data
  • levels of reporting are lower for pedal cyclists, in particular when in single vehicle collisions
  • there is not a clear relationship between levels of reporting and time, or patient demographics

This work further confirms the value in exploring other datasets to develop a fuller understanding of the number and nature of road casualties, and it is hoped that this can be done through the PRANA programme.

6. Results: severity coding

Main finding: Ambulance data includes some information which can be used as a basis to assess how well police code severity of injury, which needs further work to fully explore.

A second aim of the LPHD project is to assess accuracy of injury severity recorded by police. Within this study, it has not been possible to fully explore this area but the dataset provides a basis for further analysis and a brief initial look is presented below.

6.1 Clinical information within SWAST and STATS19 datasets

The SWAST data contains anonymised information captured when ambulance service enhanced and critical care assets attend incidents; by their nature, these are likely to be initial impressions rather than final diagnoses. Relevant anonymised data fields include:

  • working impression
  • working impression detail (free text)
  • clinical care detail (free text)

The level of detail captured varies greatly between records; some do not contain anything, while others have detailed notes covering the circumstances, clinical assessment and procedures carried out (which have been suitably anonymised). In general, further work is required to extract data from within these anonymised free text descriptions, but there is potential to derive much about the pre-hospital condition and treatment of casualties.

In STATS19, injury severity has historically been assessed by reporting officers as killed, seriously injured or slightly injured based on their judgement (which may in turn be based on liaison with others at the scene of a collision). In recent years, many forces have adopted an injury-based approach to severity recording, whereby the reporting officer selects from a list of 20 injuries and overall severity is coded based on the most serious injury recorded (within the SWAST area, Devon and Cornwall and Gloucestershire are areas that were using the injury-based approach for the period studied). While this approach is believed to have improved the consistency of severity recording, it still requires judgement of officers not medically trained, and is relatively high level. It is hoped that the linked data might allow an assessment of how the police coding compares with the clinical detail.

6.2 Working impression recorded in SWAST data

As an initial illustration, table 5 shows the working impression detail, to illustrate the level of detail that is captured and coded within the SWAST dataset.

This could be compared with the STATS19 severity and type of injury, where recorded, to assess how the police coding compares with the initial clinical impression. Further work could then explore the detailed clinical severity, based on metrics such as injury severity score, if this data can be linked to other healthcare datasets.

Table 5: 10 most commonly coded SWAST working impression categories for linked records

Working impression          Count
Major Trauma Criteria Met          34
Head Injury (Closed)      31
Trauma - Multisystem           29
Chest Injury - Blunt              21
Trauma - Other                 19
Leg Injury                        17
Cardiac arrest                      17
Abdo or Pelvic Injury - Blunt     13
Leg Fracture                      12
Head Injury (Open)      9

6.3 Conclusions and further work

As noted above, while the linked dataset should facilitate comparison of clinical and police assessment of severity, this requires further work to be undertaken to draw meaningful conclusions. This work should be possible in collaboration with the PRANA programme hosted by the NHS Wessex SDE.

SWAST data contains much more detailed clinical information than is present in STATS19, and should it be possible to link this to other health datasets – such as hospital admissions – that would provide further data and insight into the clinical outcomes of road collisions.

7. Overall conclusions and next steps

This study demonstrates the feasibility of linking STATS19 data on RTCs with a subset of ambulance data for one region, establishing that such a linkage is possible, even when based on non-patient identifiable information. 

The initial findings, while not novel, provide further confirmation that police data on road collisions is incomplete for non-fatal casualties, and illustrate the value of exploration of other datasets to improve the evidence base.

This approach is necessary while there is no common identifier recorded across police and healthcare datasets; were this to be possible, for example through sharing of incident numbers, then a more reliable linkage could be established. However, this remains a longer-term ambition. In the meantime, this work represents what is hoped to be the first stage within a more ambitious project within the PRANA environment to develop linkage between police and hospital data for RTC casualties across England. As such, further work is planned, which includes, subject to the availability of data:

Further development of the SWAST data linkage. It is hoped to further explore the linkage between SWAST and STATS19 data, including:

  • further review of the matching methodology to assess scope for improvement
  • further analysis and development of the linked dataset, for example extending to add more years or a wider range of incidents attended by the ambulance service
  • more detailed review of the clinical severity information, in conjunction with clinicians
  • engagement with police forces and local authorities based in the SWAST region to explore the findings, including reasons why records in SWAST may not appear in STATS19

Extension to other datasets. It is hoped to extend the linkage to cover additional healthcare datasets, in particular:

  • linkage of data for the SWAST region to hospital admissions and trauma patient data
  • linkage of national level data where this is available, to check that the findings hold for a wider area

Detailed analysis to provide new insight into collision risk and costs. The third aim of the LPHD project is to use the linked data to provide new insight into collision risks and costs. This is something that can be considered further now that the feasibility of data linkage has been established.

It is hoped that, in time, wider access to the linked data can be provided to allow accredited researchers to analyse the data in support of this aim.

We welcome any feedback on this initial work, and suggestions for further analysis.

Acknowledgements: DfT are grateful to Professor Chris Kipps and Professor James Batchelor and the NHS Wessex SDE team and Dr Phil Hyde and the PRANA team for their support in facilitating this data linkage and anonymised data access within all appropriate national governance. We are also grateful to Dr Sarah Black and Dr Matt Thomas and the SWAST Research and Development group for providing the PRANA programme with the ambulance data used in this initial study.

8. Annex: Background information

8.1 Linkage methods – validation

This section provides brief details of the validation carried out on the record linkage. 

Validation of linked records: The 379 linked records were manually reviewed, using other information (such as incident description) to assess whether the records were likely to represent the same incident. In the majority of cases – 345 – it was deemed likely that link was correct, based on agreement on road user type (for example cyclist, motorcyclist), position in vehicles (for example driver or passenger) and nature of incident (for example, whether single vehicle or multi-vehicle). There were a further 31 records where there was insufficient information to assess (3 of which contained some conflicting information), and 3 which were likely to represent an incorrect linkage. Of these 3, 2 were linked records related to different collisions (where the SWAST collision was not in STATS19, but another sufficiently similar incident was) and 1 where there were multiple casualties in the same collision in STATS19 and the wrong one linked.

Validation of unlinked records: To assess the extent of possible missed matches among the 771 unlinked records, a manual review of a small sample of 30 records was carried out, listing all STATS19 records within STATS19 on the same date and same police force area. This revealed several cases where a likely match had been missed due to an age difference or time difference outside the allowed thresholds. Further work would be required to quantify this, but as a broad guide it could be in the range of 5 to 10% of the unlinked records.

Of course, this manual review involves a degree of subjectivity, in particular to search for missed matches where even a manual review may miss true, linkages in the cases where, for example, key linkage variables are inaccurate. Additional work would help to further assure the linkage, and identify areas for improvement in the overall linkage rate.

9. Instructions for printing and saving

Depending on which browser you use and the type of device you use (such as a mobile or laptop) these instructions may vary.

You will find your print and save options in your browser’s menu. You may also have other options available on your device. Tablets and mobile device instructions will be specific to the make and model of the device.

Select Ctrl and F on a Windows laptop or Command and F on a Mac.

This will open a search box in the top right-hand corner of the page. Type the word you are looking for in the search bar and press enter.

Your browser will highlight the word, usually in yellow, wherever it appears on the page. Press enter to move to the next place it appears.

  1. Besides DfT and the Road Safety Trust for the DSL project, PRANA is funded by: NHS England Data for Research and Development, Southampton Biomedical Research Centre, Wessex Secure Data Environment, Wessex Health Partners, Wessex Experimental Medicine Network and Wessex Applied Research Collaboration and the University of Southampton. 

  2. Full details of the coverage of STATS19 can be found in the guidance document known as STATS20

  3. Details of the classification of injury severity within STATS19 can be found in the published guidance