Research and analysis

Classification accuracy in results from KS2 National Curriculum Tests: Summary

Published 16 May 2013

Applies to England, Northern Ireland and Wales

1. Overview

Overview by: John Winkley, AlphaPlus Consultancy Ltd.

The report describes the results of a seminar and preceding research work investigating the reliability of level classifications in England’s key stage 2 national tests (for 11 year olds) in science, mathematics and English taken in 2009 and 2010.

Students receive a level from 2 to 5 for each of these tests, corresponding to progress against National Curriculum objectives. The issue under investigation is whether the levels received by students reflect their actual levels of performance - in other words, are the test results reliable in terms of their classification accuracy?

The different statistical methods used to calculate the accuracy of classification show similar results in that classification accuracy is estimated at around 85% for English, 87% for science and 90% for mathematics, a substantial improvement compared with the tests at their introduction in the late 1990s.

2. Introduction

In 1995, National Curriculum tests in science, mathematics and English were introduced in England for children in the final year of primary school education (year 6, age 11). In mathematics and English, these tests continue to this day (science tests continue for a small sample of 11 year olds). The tests consist of multiple parts – for example, the mathematics test includes a calculator paper (40 marks), a non-calculator paper (40 marks) and a mental arithmetic test (20 marks) – with a candidate’s total score being the sum of scores on the various parts.

For each test, children are awarded a level from 2 to 5, with most children expected to achieve a level 4. Children’s scores in the test are converted to levels by using level boundaries that are adjusted each year for test difficulty, with the intention of maintaining standards over time. For example, in the mathematics test (marked out of 100), the range for level 4 was 46 to 76 in 2009 and 46 to 78 in 2010 - this adjustment process is called ‘test equating’ and relies on a sample of candidates in 2009 taking both the 2009 live test and 2010 pre-test.

In 2001, research demonstrated that unreliability in the tests meant that up to 30% of children taking the tests could be misclassified, ie awarded a level that did not correspond to their actual level of achievement. Research for Ofqual’s reliability programme in 2010 shows that misclassification estimates have fallen to 10 to 15%, depending on the subject.

This research summary looks at the factors that contribute to examination unreliability and how unreliability and, in particular, misclassification can be measured.

3. What is classification accuracy and how is it calculated?

Classification accuracy is a measure of examination reliability which is used when the results of the examinations are a grade or level (rather than a score). Reliability is a key component in examination fairness - it refers to the extent to which a group of candidates would obtain the same results for an assessment irrespective of who marks their papers, what types of question are used (for example, multiple-choice or essay questions), which topics are set or chosen to be answered on a particular year’s paper, or when the examination is taken. While reliability often considers the repeatability of scores, classification consistency looks at whether candidates would get the same grade on different test occasions - the public would expect that a KS2 candidate sitting several years’ tests papers would receive the same level, whereas they might not be bothered if the candidate’s scores varied a little from paper to paper.

There are a number of different statistical methods for calculating classification accuracy, all of which involve making assumptions about the test data so the analysis results are necessarily estimates (some of these assumptions are discussed below). In this project, researchers used six different models and found that the results were very similar (within three or four percentage points of each other on classification accuracy).

All the approaches are based on the same fundamental steps (the principles and outcomes are comprehensible to non-mathematicians even if the actual methods are not):

  • modelling and estimating all candidates’ ‘true scores’ (the true measure of their ability – the average score they would receive if they took many tests of the same subject at the same time – a theoretical concept really)
  • from this, estimating candidates’ ‘true levels’ based on the modelled ‘true scores’
  • comparing ‘true levels’ with the observed levels

The different methods used have different benefits: some try to allow for the fact that tests may be more reliable for some scores than others (e.g. the KS2 tests tend to classify candidates more reliably at level 5 than at level 2); others use a simple model of uniform error across the range of scores. One method (called Item Response Theory) seeks to create a scale of difficulty for items and a scale of ability for candidates. This allows statistical comparisons of candidates who took tests in one year with other candidates who took a different test in another year. ‘Separating out’ a candidate’s ability from a question’s difficulty is very attractive in looking at how candidates would perform on multiple tests (without them actually having to sit all the tests) but it relies on some big assumptions – for example that the test assesses a single trait (such as ‘a candidate’s ability in science’) and doesn’t allow for the fact that some students perform better than their peers in some topics and less well in others.

4. What factors affect reliability and classification accuracy?

Many things can affect reliability: marker inconsistency, differences in test papers from year to year in terms of the content coverage, the types of questions used, the difficulty of the questions and where the level boundaries are set.

Classification accuracy is also affected by how many levels there are in the test results, and the consequent mark range for each level. A test may provide people with more information by having more levels (consider for example, the introduction of the A* grade at GCSE and A level) but this will result in a lower level of classification accuracy for all levels.

5. Classification accuracy results for Key Stage 2 tests in 2009 and 2010

The first analysis undertaken looks at the internal consistency of the test – if the test measures a single trait, then good candidates would be expected to do better than weaker candidates on all items. An item that isn’t consistent in this way is testing something other than the single trait, and undermines the internal reliability of the test (and one of the key assumptions for the statistics). KS2 tests in 2009 and 2010 have high levels of internal consistency, particularly when it is considered that the tests are marked by people and the scores therefore include a degree of variance due to marker subjectivity.

Classification accuracy for mathematics was around 90%. The classification accuracy for English was around 85%, and science around 87%. The six methods provided classification accuracies mostly within two percentage points of each other.

These classification accuracies are much higher than those reported in the early days of Key Stage 2 national tests. The increase in classification accuracy is likely to relate to improvements in the reliability of the tests and changes in their structure.

It should be noted that classification accuracy is an estimate based on a population as a whole – it cannot be used for an individual candidate. A candidate whose score is close to a boundary mark is more likely to be misclassified than one whose score is in the middle of the mark range for the level.

Finally, it is clear from the work on the Ofqual Reliability Programme that technical terminology presents problems in many areas of assessment reliability. In this context, ‘candidate level misclassification’ in common meaning suggests a mistake has been made in handling the exams process (for example, a marking error has been made). The statistical term refers to ‘measurement error between the observed and notional ‘true’ scores’ – i.e. not so much a mistake as a measurement of the variance between a notional score (the average score the candidate would receive if they took many tests of the subject one after another) and the actual score they achieved on the particular paper they sat.