Research and analysis

Introduction to the concept of reliability

Published 16 May 2013

Applies to England, Northern Ireland and Wales

1. Introduction

This is a non-technical introduction to the concept of reliability as used in educational assessment, including an explanation of how reliability is estimated and interpreted in the contexts of National Curriculum tests, General Qualifications (GCSEs and A levels), Vocational Qualifications and other assessments.

2. What do we mean by reliability?

Public examinations have to be fair - it is Ofqual’s job to make sure that candidates get the results they deserve, and that their qualifications are valued and understood in society. Ensuring examination reliability is a key part of this – making sure that candidates obtain a fair result, irrespective of who marks their paper, what types of questions are used (for example multiple choice or essay questions), which topics are set or chosen to be answered on a particular year’s paper, and when the examination is taken.

This consistency of examination results is referred to as reliability - the repeatability of results from one assessment to the next, be they assessments taken on different days, or from one year to the next.

In everyday use, “reliable” means “that which can be relied on”, but the technical definition in educational assessment is narrower: “the extent to which a candidate would get the same test result if the testing procedure was repeated”. The technical definition of reliability is a sliding scale – not black or white, and encourages us to consider the degree of differences in candidates’ results from one instance to the next.

3. What can cause unreliability in assessment results?

Awarding organisations have to consider a number of different sources of potential unreliability.

3.1 Inter-marker reliability

One exam marker might be more or less lenient on particular questions than the next (or even the same marker might be more or less lenient from one day to the next). Additionally, markers may make clerical mistakes – simply awarding the wrong mark or forgetting to include a mark in totalling the candidate’s score. Awarding organisations work hard to minimise this source of unreliability: they try to write very clear mark schemes, they train markers and check their work (a process called standardisation).

3.2 Variability in a candidate’s performance

A candidate’s performance on an exam might vary a little from one day to the next particularly if the conditions of the exam change (morning or afternoon, who is administering the test, how well they slept the night before, whether the caretaker is mowing the lawn outside, whether the candidate has a headache or not, etc.). This is not an area of unreliability that awarding organisations can do much about, although these factors do affect decisions about the length of exams, exam timetabling and allowances which are made for candidates with special needs, for example allowing candidates longer to write their papers if they have an injury that affects their writing.

3.3 Different examination papers

Different questions will appear from one exam paper to the next which might test different facets of the candidate’s understanding (tests usually sample from the curriculum because there is not enough time to test everything, and candidates may choose to revise one topic but not another). Exam setters try to minimise these effects by keeping question and option choices to a minimum and ensuring tests cover a balanced range of topics (although reducing question and option choices can restrict the coverage of a paper which may be undesirable). Nevertheless we know that individual candidates‘ performances are likely to vary somewhat depending upon which version of the examination they take.

3.4 Comparability of results from one year to the next

A Grade “A” Mathematics A level is often taken to mean the same thing for candidates completing in 2009 as it does in 2010. Why? As an example, grades are used to select candidates for entry to university, and in any one year students may be applying with results obtained in different years. So examination standards need to be maintained over time, and awarding organisations attempt to do this by examining exam scripts and using statistical evidence. However there are uncertainties in this process, particularly where there have been significant changes in the examination format from one year to the next (see the next bullet point for an example of this).

3.5 Differences between examination specifications

Ensuring comparability over time can be challenging when there are changes to exam specifications and syllabuses. For example, with no previous evidence, how can examiners know whether a new topic or type of question is harder or easier than ones that were used before? Awarding organisations may need to rely here more on statistical data in order to ensure that the results are fair from one year to the next.

3.6 Different types of assessment activity

Many qualifications are made up of different types of assessment activity, for example most GCSEs include both examinations and controlled assessment (tasks of several hours undertaken in the classroom and marked by the teacher), and these different assessment methods present different types of assessment reliability challenges. Classroom assessments often have a wide range of available tasks so the mark schemes given to teachers must allow for easier and harder tasks. Awarding organisations also inspect teachers’ marking to ensure it is consistent from one teacher to the next.

3.7 Different types of questions

Candidates may perform differently depending on the type of questions they are given. For example, how can we assess someone’s historical knowledge accurately? A series of questions about dates of battles, names of kings and queens, etc. would assess their knowledge of facts, but what if a different set of questions were asked? Changing the type of question to ask students to reason about historical events – to explain why certain events happened the way they did – might well give a quite different picture of their attainment in history. What would be the effect of using multiple-choice questions instead of essay questions? Sometimes, assessment questions inevitably measure skills that do not form part of the assessment – in a mathematics test candidates still have to read the question for example. This may not be a problem for a terse, pure mathematics questions, but for a problem-solving question candidates have to read and understand before they can even start to apply their mathematical knowledge.

All these factors can have an effect on the consistency of results from one assessment to the next, and awarding organisations have to make sure that candidates’ results are fair and consistent, taking all these factors into account as far as possible.

4. Reliability is not the only factor in assessment quality

Awarding organisations must also consider the balance between examination reliability and authenticity (often called “validity”). For qualifications to be useful, as well as being reliable and fair, they must test the skills and knowledge they claim to test, and these skills and knowledge must be valued by society. Consider the history example above: a multiple choice test of history facts could be very “reliable” – avoiding issues of marker variability, and allowing a wide range of topics to be covered. But simply requiring a candidate to pick one of four options is not authentic for many situations. The examiner may want to test the candidate’s understanding of the historical context – why events are linked, themes that emerge, etc. Essay questions are much more suitable for this type of assessment. The same balance applies in vocational and professional assessment. Multiple choice tests may be convenient to set and mark, but patients don’t turn up at doctors’ clinics with a definitive list of possible ailments for the doctor to pick from!

The English examination system places great value on validity in assessment - essay and long answer questions, worked problems in mathematics, practical tests for skills, and a variety of assessment activities (exams, coursework, observation of team work, etc.) are frequently used in order to give the candidate the best chance to demonstrate what they can do, even though this approach can make the reliability challenge more difficult. Reliability is frequently viewed as an essential aspect of validity, unless the results are reliable, the test cannot be valid.

5. How can assessment reliability be measured?

Awarding organisations work hard to maximise the reliability of an assessment, but there is then a need to attempt to quantify the actual reliability achieved based on the assessment results.

Measurement of reliability involves repeating the same testing process on the same group of candidates, but this can be difficult in practice (alternative tests may not be available, candidates may not wish or be able to take two tests, it may be impossible to prevent learning taking place between the two tests) so reliability calculations are also made using models developed and refined by statisticians over the course of the last hundred years or so. Modelling, in this sense, means that they use equations to describe assessment reliability. In doing so, they make assumptions about the nature of the assessments, the markers and the candidates - these assumptions are essential to make the mathematics work, but they also mean that all measurement of reliability involves a degree of estimation. Researchers often use several different models to analyse particular test features, comparing findings to increase confidence in the conclusions.

The statistical modelling often describes candidates as having a notional “true score” – the score that represents their actual ability (the score they would get on average if they took lots of different tests), plus or minus an “error score” (different for each test instance), the total of the two being the “actual test score” (or “observed score”). The statistical approaches to reliability are complex, but we will consider three of the more important concepts here, with examples taken from real examinations.

First, though, it is important to note that the concept of “error” doesn’t mean that the test was managed badly with mistakes being made, it just means that the test results are different to what they might be on another test on another day – i.e. reflecting variability in observed performance.

6. Internal Reliability

Researchers often start by looking for internal reliability in a test: consider a test that is designed to measure a single educational concept, for example “fractions”, or “reading”. In a reliable test we would expect a good candidate to do better on ALL the questions than a weaker candidate. If, for a particular question, they don’t, then either the question is measuring something else, or the concept being tested is so broad that candidates can have differing profiles of skills for different aspects of the concept. A fractions question might be hard for a good mathematics student because it includes a lot of reading and they happen to be a poor reader. Another student might be good at adding and subtracting fractions, but poor at multiplying them.

It is relatively easy to look at how internally consistent tests are – for example it is possible to split the test questions into two halves and look at how the candidates have done on each half separately. This procedure can then be repeated for every possible combination of “halves” to give an average correlation between the two halves – a measure of the internal consistency of the test. As an example, in the case of the 2008 National Curriculum Reading pre-test at age 11, the internal consistency of 0.88 implies that 88% of the information in pupils’ scores reflects variation in their true score, and 12% is due to some form of measurement error.

An internal consistency level of over 0.85 is normally considered an acceptable level of internal reliability. However there are unavoidable trade-offs in the design of an examination system. For example, longer tests are shown to provide more internally reliable results, which is perhaps not surprising, but awarding organisations have to make practical decisions both about the demands upon candidates taking many exams and the costs of marking and setting long papers.

7. Classification Accuracy

For some assessments, results are given in the form of test scores, and sometimes results include a confidence interval, for example a result of “34 out of 50 +/-3” meaning that that the candidate’s actual score on the test was 34, and their true score is 95% likely to be in the range 31-37.

However qualifications in England tend to be classified, with candidates’ results coming in the form of a level or grade, covering a range of marks. For example in the KS2 test above, candidates receive a level from 2 to 5, and in GCSEs candidates receive a grade from A*-G or U.

Measurement error in scores is often considered to be more tolerable provided the impact doesn’t change the grade allocated to a candidate. Clearly, candidates with scores near the grade boundaries are going to be more at risk of misclassification than those with scores in the middle of the band. It has been estimated that around 85% of candidates overall were classified accurately on the 2010 age 11 National Curriculum English Test, and around 90% on the mathematics test, probably indicating that mathematics tests are easier to mark reliably.

It is interesting to note that the more grades that are available for an exam, the greater the unreliability associated with the grade because the grade is covered by a smaller mark range and so is more susceptible to misclassification. If there are fewer grades, the percentage of misclassification falls but, on the other hand, the size of the error when a candidate’s grade is misclassified is greater. A balance between these two aspects needs to be struck.

8. Composite Reliability

Many qualifications are awarded on the basis of scores from multiple components (one or two written exams, coursework or controlled assessment, oral exams, etc.) by combining scores on each of the components. The reliability of a composite score is related to the reliabilities of its components, and is also affected by the way the scores are combined, and the extent to which the components themselves are correlated with each other. This is where the statistics get quite complicated to follow, and is an area where assessment research is still developing.

Researchers may also consider other aspects of reliability: inter-rater reliability looks at the differences in marks awarded by different examiners (for example by double marking examination papers). In some cases it may be possible for candidates to take two tests (in trials for example), so that inter-test reliability can be measured. As noted earlier, there are many different potential sources of unreliability in assessment results. One focus of modern research activities is on statistical ways of isolating the contribution that each source is making to the overall measure of reliability so that assessment process improvements can be made.

9. Is unreliability in assessment results inevitable?

A degree of variability in results from one set of assessments to the next is inevitable – this is measurement error, and is distinct from operational errors (mistakes in exam setting or marking activities etc., which are not inevitable). The challenge for awarding organisations is to minimise unreliability both by seeking to eliminate operational mistakes and by designing assessment systems that maximise reliability more generally.

This raises the question of “When is an assessment reliable enough?”. Clearly the answer here depends on the purposes to which assessment results are being put. Measurement error for a particular candidate’s results in assessments used to provide information about how the education system is performing (for example the National Curriculum Tests in England) would be of little consequence because it is the reliability of the overall results which would matter. Even comparatively unreliable individual measures can, in combination, provide a reliable overall measurement. Unreliability in results which the candidates themselves use (for example to apply to university) is much less desirable. Many assessments in England have multiple uses – the National Curriculum Tests are used for measuring performance of the education system as a whole, and of individual schools, but specific results are provided to parents for their children. This makes judgements about how to measure reliability, and the acceptable level of unreliability much more difficult, and this is an area where Ofqual’s research and policy are still developing.