Research and analysis

Summary of the final report

Published 16 May 2013

Applies to England, Northern Ireland and Wales

0.1 Introduction

Students in England take many public examinations: tests at age 11 in English, science and mathematics, perhaps 8 or more GCSEs at age 16, 3 or more A levels at age 18, as well as a wide range of vocational qualifications taken by candidates in schools, colleges and at work.

Public examinations have to be fair - it is Ofqual’s job to make sure that candidates get the results they deserve, and that their qualifications are valued and understood in society. Ensuring examination reliability is a key part of this - making sure that candidates obtain a fair result, irrespective of who marks their paper, what types of question are used (for example, multiple-choice or essay questions), which topics are set or chosen to be answered on a particular year’s paper, and when the examination is taken. This consistency of examination results is referred to as reliability, and from 2008 to 2010, Ofqual has been conducted the reliability of results programme: investigating the ‘repeatability’ of candidates’ results from one test to the next, in national tests, public examinations and other qualifications.

The reliability programme brought together a range of experts in public examinations, from research organisations and exam boards, to undertake a range of research activities in order to better understand examination reliability, and to help Ofqual develop its policy around regulation on reliability. This report provides summaries of each of the 20 major pieces of work undertaken in the project, including:

  • studies of examination reliability in national curriculum tests at key stage 2 (tests for 11 year olds in English, science and mathematics), GCSEs and A levels, as well as consideration of the reliability of teacher assessment and workplace qualifications
  • investigations into the different statistical methods which can be used to look at examination reliability
  • how information about examination reliability is currently provided to the public in the UK and other countries
  • investigations of the English public’s perception of unreliability in examinations
  • advice to Ofqual on whether and how it should regulate for reliability in examinations

0.2 How is examination reliability measured and reported?

Reliability calculations are made using statistical models developed and refined by statisticians and mathematicians over the course of the last 100 years or so. Modelling, in this sense, means that they use a mathematical representation (i.e. one or more equations) to describe (or model) assessment reliability. In doing so, they make assumptions about the nature of the assessments, the markers and the candidates (some make bigger assumptions than others, and it is almost always impossible to work out whether the assumptions are actually valid in practice).

The mathematical principles and formulae underpinning these approaches are fairly inaccessible to non-mathematicians (some of the introductory documents describe the principles) but the outcomes generally are comprehensible.

Research often considers the 2 areas of:

  1. internal reliability - looking at whether the test or tests are internally consistent (individual candidate generally doing well on most questions or generally not so well on questions about the same topic) and whether the test content covers the domain fairly.
  2. external reliability - looking at factors which would affect a candidate’s performance from one test occasion to the next (eg marker variability, different test papers from year to year).

It is important to note that calculations of reliability are always estimates, and so care has to be taken when interpreting reliability analyses - researchers often use several models to analyse particular test features to increase confidence in the findings.

The reporting of reliability in assessment varies considerably around the world. Relatively few public examinations have reliability information published alongside results, perhaps because this requires careful handling to avoid public and media misinterpretation. Reliability reporting is most prevalent in the USA where it is provided more commonly for educational examinations than for professional or licensure examinations. Many US state school examinations provide parents with information about their child’s grade (or score band), the score itself and a measure of imprecision in the score, along with guidance on how to interpret this. Contrast this with the UK (for example), where GCSE candidates simply receive grade information.

1. Research into the reliability of examinations in England

1.1 Key stage 2 national curriculum tests

In each year from 1995 to 2008, 11-year-olds took the national science test (along with national tests in English and mathematics), with the result that each child was given a level for each subject showing their progress in the national curriculum.

For the science tests, as a trial in each year, a small number of children also took the test that was to be used in the subsequent year. This helped the test writers ensure that tests were of comparable difficulty each year, but it also allowed researchers to see whether children would have received the same level on both exams - a key measure of fairness. In 2004, 72% of children received the same level on the live test and the 2005 pre-test. In 2008, the figure had risen to 79% of children receiving the same result on the 2008 live test and the 2009 pre-test.

Why wouldn’t all the children receive the same level from both tests? It is difficult in practice to produce tests which are absolutely ‘equal’ in this respect – writers have to choose topics from the curriculum to test on, and some topics may be harder than others (there are too many topics to cover them all and, if the same topics came up every year, then students and teachers might not cover the topics that don’t). Markers try to mark tests consistently, but longer answers leave room for subjectivity, no matter how tight the mark scheme. Even the day and time of the examination could make a difference to how students perform.

For the 2009 pre-test, the researchers calculated that an estimated 11% of candidates were ‘misclassified’, i.e. given one level higher or lower than their ‘actual’ level. It is important to note here that ‘misclassified’ doesn’t mean that something has gone wrong in the marking process; it just means that the candidate would have got a different result on a test from a different year.

Separate work on the live 2009 and 2010 national tests in English, science, and mathematics used six different statistical methods to calculate classification accuracy. These different methods produced largely consistent results (with the small differences relating to the different assumptions that each of the methods make). Estimates of classification accuracy were around 90% for mathematics, 87% for science, and 85% for English. The classification accuracy for mathematics is higher than for English and science probably because mathematics answers tend to be right or wrong, reducing marker variation. The amount of estimated misclassification has reduced over the time of national testing, reflecting the fact that the assessment process becomes more reliable with experience (as, for example, mark schemes become clearer).

1.2 GCSEs and A levels

GCSE and A level assessments are made up of units and components, with candidates taking several different examinations/assessment activities, from which their scores are aggregated to produce a final grade. Research into a range of qualifications from November 2008 to June 2009 showed that classification accuracy for a range of units (the proportion of the candidature placed at the correct grade) ranged from 50% to 70% (with 90%+ in either the correct grade or the adjacent grade above or below), but that classification accuracy for the qualification (i.e. when the various unit results are combined) would be substantially higher. The research also found that for units consisting of mostly short answer or structured response items with little room for marker interpretation, test-related unreliability was higher than marker-related unreliability. In other units, with longer answers and more complex mark schemes, marking unreliability (i.e. that one marker might award more or fewer marks for a candidate’s response) may be a greater factor.

1.3 Workplace-based qualifications

Workplace qualifications work differently from GCSEs and A levels. Candidates tend only to be entered when they are ready, and assessments tend to be based around ‘competency’ – the candidate is expected to complete almost all the assessment activities correctly, with the outcome being a pass or fail (not graded). Workplace assessment tends also to be based on either observing the candidate performing tasks, or looking at evidence of their performance, with no limit on the numbers of attempts, and encompassing a wide variety of settings and performance activities, all of which introduce potential unreliability of assessment.

The research looked at a small number of National Vocational Qualifications (NVQs), gathering additional data (over and above that normally produced for assessment) for the analysis, which showed that assessors have a very high level of agreement about candidates’ performance, but which also highlighted that much more data would need to be collected for vocational qualifications to allow these types of analyses to take place routinely.

2. Practicalities

There are unavoidable trade-offs in the design of an examination system. It may not be surprising, for example, that longer tests are shown to provide more-reliable results, but awarding organisations have to make practical decisions both about fairness to candidates taking many exams and the costs of marking and setting long papers. Similarly, the more grades that are available for an exam, the greater the unreliability associated with the grade (because the grade is covered by a smaller mark range and is therefore more susceptible to misclassification by marking error, for example).

Finally, a theme that occurs repeatedly throughout the reliability programme is the balance between reliability and authenticity (often called ‘validity’). For qualifications to be valued, they must test the skills and knowledge they claim to test, and these skills and knowledge must be valued by society. Multiple choice tests can be very ‘reliable’, in that tests can be statistically chosen to be balanced and marker variation is eliminated, but simply requiring a candidate to pick one of four options is not authentic for many situations – patients don’t turn up at clinics with a definitive list of possible ailments for the doctor to pick from. From a validity perspective, the experts working in the programme place great value on essay and long-answer questions (including, for example, worked problems in mathematics), particularly for assessments of academic subjects like GCSEs, A levels and national curriculum tests (the main subject of the Ofqual reliability programme) even though they present challenges for reliability.

3. Public perceptions of unreliability in examinations

The reliability programme looked at the public perception of unreliability in examinations, talking to teachers, parents and students through a series of workshops and surveys. It is clear that the public has a degree of understanding of unreliability, distinguishing for example between the various factors that can introduce measurement error into examinations. It is also clear that, although there is tolerance for inevitable variability in the process (for example, topic sampling and a degree of subjectivity in marking), there is little tolerance for ‘preventable errors’ such as markers not following the mark scheme, or adding up the marks wrongly, especially where these errors result in a student getting the wrong grade (not just the wrong score). Students and parents show a high degree of trust in the system - teachers less so, particularly where their involvement in examinations appeals has shown them where assessment errors can occur.

Throughout the discussions, technical terminology presents problems - ‘measurement error’ in common meaning suggests a mistake has been made in the examination process, whereas the statistical meaning points to the difference between the observed and notional ‘true’ scores. Similarly, in common meaning ‘reliability’ is perceived as an absolute (a test is either reliable or not), not a sliding scale of confidence as it is in statistics.

It is clear that providing information to the public about assessment reliability in an effective way is difficult. The concepts are hard to explain well, and unreliability can seem like an intrinsically bad news story with plenty of opportunity for misinterpretation. If assessment reliability information is to be published, it needs to be accompanied by resources to help the public understand what the information means.

4. Supporting the development of Ofqual’s policy on reliability

The reliability programme was created in recognition of the fact that there had been little sustained and systematic evaluation of the reliability of results from England’s assessment systems, and little understanding of the public’s knowledge of and attitudes towards unreliability in these results. The programme’s technical advisory group and policy advisory group made a number of recommendations which will be used as a basis to develop Ofqual’s policy on reliability:

  1. Ofqual should outline the primary purpose of each qualification and Ofqual should regulate against that purpose.
  2. Awarding organisations should publish their standard setting practices in order to make the regulation of reliability in standard setting more transparent.
  3. Awarding organisations should report on the reliability of assessments, using different measures for different types of assessments, but with consistency of approach between awarding organisations.

The programme has also made recommendations for further work:

  1. Assessment reliability information should be available in the public domain, provided both by awarding organisations as a routine part of their assessment monitoring and from investigative research by Ofqual. However, this information needs to be accompanied by public education activity to help with understanding the difficult concepts, and there needs to be capability within Ofqual to manage media coverage of the topic.
  2. Reliability information should be focused at the level that has impact for the public: reliability around qualification grades, for example, is more important than reliability around unit results or assessment scores.
  3. Ofqual should lead work to look at reliability measurement in non-examination assessment methods such as teacher assessment and workplace observation.