Research and analysis

Appendix 1: Ofqual’s reliability programme remit

Published 16 May 2013

Applies to England, Northern Ireland and Wales

Reliability, in educational assessment terms, can be defined as consistency. A high level of reliability means that broadly the same outcomes would arise were an assessment to be replicated. Given the general parameters and controls that govern the assessment process (including test/exam specification, administration conditions, approach to marking, standard setting methodology and so on), reliability concerns the impact of the factors that inevitably vary from one assessment to the next. These include:

  • the particular occasion (eg if assessed on another day, the student might have been less tired)
  • the particular test (eg if a different test/exam had been set, the student might not have been confused by the wording of an essay title)
  • the particular marker (eg if a different marker had been assigned, the student might have been marked down for using an unusual stylistic construction)
  • the particular standard setting panel (eg if a different team of people had been involved, different grade boundaries might have been set)

In England, there has been little systematic and sustained attempt to evaluate the reliability of results from national tests and examinations. The work that has been undertaken has been:

  • isolated (ie not part of routine monitoring)
  • partial (ie limited to certain sources of unreliability and to a small number of tests and examinations)
  • under-theorised (ie with little serious debate over the interpretation of evidence)
  • under-reported (ie not always published)
  • misunderstood by stakeholders, both inside and outside assessment agencies

A substantial programme of research into reliability will help to improve this situation. The project will consist of three strands:

  1. generating evidence of reliability
  2. interpreting evidence of reliability
  3. developing a policy on reliability

1. Strand 1: Generating evidence of reliability

1.1 Aim

  1. The aim of strand 1 will be to generate robust evidence of the overall reliability of results from a number of major national tests and/or examinations, estimating the degree of consistency associated with different aspects of the assessment process.

1.2 Methodology

  1. The precise methodology will be subject to discussion with assessment experts and agencies. Not all sources of inconsistency will necessarily be investigated, although there will be a particular focus on test-related and marker-related inconsistency. The primary focus of attention will be on reliability at the student level, although implications for reliability at the cohort level will also have to be considered given the widespread use of aggregate scores for comparative purposes at national, regional and local levels.

  2. Comprehensive estimates of reliability will require experimental simulation as well as the analysis of data which arise as a natural by-product of testing and examining. For example, to estimate the consistency of performance across test/exam forms, it may be necessary to administer alternative versions to the same students. To estimate the consistency of marking across scripts, it may be necessary to have batches of scripts marked by multiple markers. Ideally, these variables will be manipulated within a single experimental design.

  3. It is desirable that, over time, such analyses will be undertaken across a range of subjects, for a range of tests, examinations and qualifications and considering both externally assessed and internally assessed components. Reliability estimates inevitably differ across contexts, being sensitive to a range of factors, from the group of candidates entered to the design of the assessment process, so estimates for one instrument cannot necessarily be assumed to generalise to another. In the long term, this might imply the need for a monitoring programme, rather than occasional studies.

  4. In the short term, it would be wise to begin by focusing on a limited number of tests and/or examinations. Even starting with a small sample - perhaps English and mathematics tests at key stage 2 - the project will be substantial, complex and costly, due to the large number of variables to be manipulated experimentally.

2. Strand 2: Interpreting evidence of reliability

2.1 Aims

  1. The aims of strand 2 will be to stimulate, capture and synthesise technical debate on:

  2. the interpretation of evidence from reliability studies
  3. the communication of results from reliability studies.

2.2 Methodology

  1. The interpretation and communication of evidence from reliability studies is a highly complex challenge which will require collaboration between assessment experts, agency representatives and communications specialists. It is likely that this strand will tackle the two aims sequentially, with assessment experts and agency representatives debating the interpretation of evidence from reliability studies before being joined by communications specialists to discuss the communication of results.

  2. It will be necessary to identify the comparators against which reliability evidence from England’s test and examinations can be benchmarked. These might include alternative assessment models, i.e. different approaches to testing/examining or different approaches to teacher assessment, as well as test and examination systems from other countries which operate a similar approach to England.

  3. The debates will be undertaken during residential workshops, with participants being provided with working papers in advance. Outcomes from the workshops will be circulated for comment following the workshops, resulting in a series of published reports.

3. Strand 3: Developing a policy on reliability

3.1 Aims

  1. The aims of strand 3 will be to:
  2. explore public understanding of, and attitudes towards, assessment inconsistency.
  3. stimulate national debate on the significance of the reliability evidence generated by the project.
  4. develop a policy position for Ofqual on reliability.

3.2 Methodology

  1. Many myths are promoted (particularly within assessment circles) about what the public understand about assessment inconsistency, and how they will react to evidence of reliability, particularly when framed in terms of the percentage of students whose grades are likely to be incorrect. The reality is that we simply do not know what the public thinks and feels on this matter.

  2. This research will engage with members of the public - students, parents, employers and so on - listening to their views and beliefs, using a series of surveys and focus groups.

  3. The findings will be promoted more widely, through engagement with the national media and through the use of discussion documents on the Ofqual website. These debates and discussions will help inform an Ofqual policy position on reliability that will need to be developed. The policy is likely to include both how public and professional understanding of reliability can be improved, including the evidence that needs to be generated to inform this understanding, and a position with regards to how reliability affects the reporting of results.