Research and analysis

Summary of the final report of the Technical Advisory Group

Published 16 May 2013

Applies to England, Northern Ireland and Wales

1. Defining the purpose of qualifications and assessments

There are several parties that have a claim to define the purpose of qualifications such as GCSE and GCE A levels; historically however, none of them has done so, preferring to refer to a range of possible uses. The government sets frameworks for the structure, content and standards of the examinations to support its educational policy objectives. Some awarding organisations operate independently and have their own educational purposes, which they aim to promote through the qualifications they provide. Ofqual monitors the standards and quality of those qualifications. However, different views about what qualifications and assessments are required to be reliable measures of have different implications for how their reliability is defined and measured, because they affect which of the various sources of variation in candidates’ results can properly be treated as error. Ofqual needs, therefore, to define the purpose against which it intends to measure the reliability of each qualification or assessment that it regulates.

2. Regulating reliability and publishing reliability information

It is important that qualifications and assessments provide measures that are both reliable and also consistent with their purpose (once this has been well defined), and it is reasonable to expect that intelligent and well-informed regulation of reliability will help to improve provision. Publication of information about the reliability of candidates’ qualification results is desirable both as a spur to improvement and also to enable users of the results to give them their proper weight in any decisions made with them.

3. Public confidence

At the same time, Ofqual also has a responsibility to promote public confidence in qualifications and assessments. Although the report supports the regulation of reliability and routine publication of reliability measures, it cautions that a judicious balance needs to be struck between publication and maintaining public confidence in the system. Media coverage is unlikely to acknowledge that less-than-perfect reliability is a fact of any assessment system, and could promote unrealistic public expectations of the system.

An important aspect of the publication of information about reliability is that the awarding bodies and Ofqual, separately and together, do not have complete control of the factors affecting it. Other agencies, including the Department for Education, are involved in decisions about qualification design which affect the reliability of those qualifications. Responsibility for reliability is therefore complex and, if unfair conclusions about responsibility are to be avoided, is an essential background for any publication of reliability statistics.

4. True scores and human judgement

In everyday terms, the reliability of a measurement is the extent to which it accurately represents the true value of whatever is being measured. In educational assessment, there is more than one way of defining candidates’ ‘true scores’. Classical test theory (CTT) and generalisability theory (G-theory) assume that a candidate’s true score is the average obtained over many replications of the assessment. Item response theory (IRT) assumes that attainment is a characteristic of the candidates (a ‘latent trait’ – such as intelligence or personality for example – characteristics of the candidate but latent in that they can’t be observed directly), and scores on that trait are therefore ‘true scores’. But none of these approaches enables us to directly observe a candidate’s true score – the score that they really deserve. All qualification and assessment results depend crucially on examiners’ judgements about what questions should be asked and what value should be attached to different answers to them. With professional judgement at the heart of educational assessment, a certain amount of variability is inevitable.

To be valid, educational assessments must cover a representative sample of the skills, knowledge and understanding which the curriculum is designed to help students learn, not simply those aspects which happen to be amenable to reliable assessment. Moreover, because the results of qualifications and other assessments have very important implications for students, teachers and schools, they strongly influence practice in classrooms, creating a ‘backwash’ from assessments to teaching and learning. This backwash means that tests and examinations must be designed, as far as possible, to promote effective learning. For example, machine-marked multiple-choice tests are highly reliable, but their ability to assess many aspects of the curriculum is weak – so their sole use as the basis for assessment would tend to promote teaching and learning of only some of the skills, knowledge and understanding which students need. Examinations and assessments therefore need to sample the curriculum in a reasonably comprehensive way if they are to be valid and avoid adverse educational backwash. In practice, this requires the use of assessment formats which involve human judgement and, inevitably therefore, compromises in terms of reliability.

5. The current state of knowledge about the reliability of qualifications and assessments in England

For validity reasons, therefore, we use a variety of assessment formats, some of which involve a substantial amount of inevitably fallible human judgement. This compromises the standards of reliability which can reasonably be expected for qualifications and assessments, to an extent which depends upon the particular formats used. Exactly what constitutes reasonable expectations has to be established empirically, through research into the maximum reliability which it is possible to achieve with different formats. The Ofqual reliability programme collected and generated a considerable amount of information about the reliability of existing qualifications and assessments in England. This included consideration of a range of different forms of reliability:

  • marking reliability - consistency between markers, and marker accuracy
  • internal reliability of the assessment - whether the test is consistent within itself
  • equivalent forms reliability - whether different assessments (e.g. GCSE mathematics from two different awarding organisations) provide consistent results

However, we still do not know enough about the factors affecting reliability statistics to be able to fully contextualise data from operational examinations. Clearly, to pursue an impossible level of reliability at great expense in terms of energy and resources would be as undesirable as allowing poor reliability to continue for lack of information about it.

Further work is needed to identify the levels of reliability which can be achieved with different assessment formats, including teacher assessment, and for different curriculum topics. On this basis, Ofqual should require greater consistency and control of assessment formats for new assessments, particularly those carried out in the workplace. Were this to happen, it would then be possible to contextualise measures of the reliability of operational qualifications and assessments and provide a sound technical basis to inform discussion of what should be assessed, and how it should be assessed, in the future.

6. Technical Advisory Group recommendations

The group makes a number of recommendations to Ofqual and awarding organisations:

  1. Ofqual should collect information on the reliability of assessments, including undertaking new studies for particular aspects of reliability. Where possible, published information should cover qualifications as a whole rather than their components. The reliability measurements should be published in a standard form.

  2. Ofqual should describe the primary purpose of each qualification and Ofqual should regulate against that purpose, allowing better judgements about reliability and fitness for purpose to be made.

  3. Ofqual should require awarding organisations to demonstrate aspects of reliability in their assessment design process. Greater consistency and control of assessments in workplace settings should be sought.

  4. Awarding organisations should publish their standard setting practices and information about the reliability of their teacher assessments to allow transparent regulation.