Research and analysis

Partial Estimate of Reliability: Parallel Form Reliability in the KS2 Science Tests: Summary

Published 16 May 2013

Applies to England, Northern Ireland and Wales

1. Overview

Overview by: Andrew Watts, AlphaPlus Consultancy Ltd.

In this study, various methods were used to analyse the reliability of the key stage 2 (KS2, year 6) science tests, comparing the results obtained when the same pupils took parallel or equivalent papers in the annual pre-tests and live tests. The purpose of the research was to understand the different ways in which reliability in test results might be measured and to describe how much uncertainty could be expected in national test and examination results.

2. Introduction

The aim of this study was to quantify how likely it would be for a key stage 2 (KS2, year 6) pupil to be awarded a different national curriculum level if he or she took a different version of the same science test. Ofqual asked the National Foundation for Educational Research (NFER) to conduct the study after discussion in the press about the extent of misclassification (pupils getting the wrong level) in national tests. One academic had claimed that as many as 30% of pupils were likely to be awarded the incorrect level.

3. Parallel forms of a test

Having been responsible, for some years, for creating national science tests for 11-year-olds in England (KS2) under contract to the Qualifications and Curriculum Development Agency (QCDA), NFER was in a position to compare the reliability of tests from different years. Each year it used to pre-test the following year’s science test papers. This was done, about 4 to 6 weeks before the actual tests were taken, with a representative sample of pupils. NFER would then compare the pupils’ results on the pre-test papers with the actual KS2 tests that year. The pupils thus sat parallel forms of the science tests - that is, the questions in the tests, though different, assessed the same things in the same way. After the tests the pupils had both a final score, out of 80 marks (from two papers of 40 marks each), and a national curriculum level of 3, 4 or 5. During the pre-test the pupils also took an ‘anchor test‘, parallel in format to one of the main papers, which was given to pre-test pupils each year. This was used to calculate whether the demands of the papers were the same from year to year.

The report acknowledges that there would be a number of reasons why pupils might have obtained different results in the two sets of papers. Some of these differences would be described as ‘random measurement errors’. Some pupils might by chance have done better in one year’s papers than the others. For others it might have been because of the way they felt on the day. Or, subtle variations in the test papers may have affected the results; in other words, the different papers may have been testing slightly different things. There may also have been differences in the way the papers were marked, in this case because one set of papers was marked as part of the pre-test and the other as part of the ‘live’ (actual) national test marking. This ‘pre-test effect’ could also include the fact that the pupils took the pre-test less seriously, or were not as well prepared for it. All of which illustrates the fact that it is not in fact possible to reproduce the exact circumstances in which tests are taken.

So, each pupil in the study had two results, one from the pre-test and one from the live test. The study aimed to find out how similar these two results were, and how reliable each result was. The more reliable they were, the more we would expect each test to give the same result.

4. Measuring reliability

When all the results had been collected, various ways were used to measure the reliability of the results. The first was a straightforward comparison of the levels each pupil obtained on each of the tests: the pre-test and the live test. In 2005, 72% of pupils would have received the same level from the two tests, whereas in 2009 79% would have received the same level. Over the six years studied there was a trend upward so that a greater number of pupils would have attained the same level in the later years. Almost all the remaining students were classified in the adjacent level. Only 1% of the pupils were awarded more than one level different in four of the five years reported on.

Reliability is also measured in terms of the internal consistency of results obtained (in other words, were all items within the test leading to the same result?). If the internal consistency of a test is the same as that of a parallel test, that would be evidence that they are equally as reliable. A statistical calculation called ‘Cronbach’s alpha’ was always reported for the KS2 national tests and this was a way of reporting the internal consistency of each year’s test. It is usual to claim that an internal consistency of 0.8 or higher would be an indication of a reliable test, and the comparisons that were made between the different KS2 science tests in the six years that were studied were, in all but one case, in the range 0.81 to 0.89.

Another way of comparing the reliability of the parallel tests is to look at the rank order of pupils’ scores in the two tests. If there are significant differences in the rank orders, it would suggest that the two tests are affecting different pupils in different ways. The rank order correlations reported in this study are fairly consistent across the years and generally exceed 0.8.

5. Reliability results

The results reported provide different measures of the reliability of the KS2 science tests in the years covered by the study. In the years 2007 to 2009, 79% of pupils gained the same level results in both the pre-test and the live test papers. However, the report distinguishes between ‘classification correctness’ and ‘classification consistency’. The figures above refer to classification consistency. They tell us the extent to which the two tests agreed. It is important to recognise that the figures do not tell us that the levels are necessarily correct. The level may have been incorrect twice. On the other hand, ‘classification correctness’, the probability that the pupil’s level in a test is the correct one, is usually described as the ‘true’ level, the one which accurately corresponds to what the pupil can really achieve in the subject. Measures of classification correctness will give higher results than measures of classification consistency, because only the results from one test are being looked at. Calculations of the classification correctness of the tests in this study– which show that between 83% and 88% of pupils would have been given the correct level when sitting the tests described in this study – were reported from 2005 to 2009.

The results of the KS2 science tests were compared with the results of a similar study of KS2 results in English, which gave classification consistencies of 73% and 67% for reading and writing respectively, for the one year that was analysed. The higher consistency figures obtained in science are likely to be the result of the more objective questions used in the science tests. The classification correctness study of the English tests came up with a figure of 84%. This report thus gives for 2009 a figure of 12% of classification errors for KS2 science and 16% for English. This compares favourably with the 30% of potential misclassification that was reported in the national press (see introduction).

6. Possible further studies of test reliability

The report illustrates how difficult it is to set two tests that are exactly parallel. The tests studied here had good internal consistency (Cronbach’s alpha) but, when the results were correlated, there was less similarity. This could suggest that the two parallel tests were not measuring exactly the same things: they were possibly not entirely parallel. The KS2 science tests, however, were fairly short (40 marks), and one way to increase reliability is to increase the number of questions in the tests. (For these 11-year-old pupils 45 minutes for each test was considered to be enough time to ask them to concentrate.) A calculation can be made to see what would have happened if the tests had had double the number of marks available. The report shows that this could have resulted in projected alpha scores of 0.93 to 0.95 in the papers used in 2005 to 2009. The report suggests therefore that these results give additional reassurance that the papers were consistent in their reliability.

The report looks at the additional information that can be gained from applying Item Response Theory (IRT) to the analysis of the tests. This method gives measures of decision accuracy and decision consistency. The report suggests that these measures should be reported in future test development reports, though they would be an additional form of internal test consistency rather than the parallel form reliability which this report is about. The value of IRT analyses is that the measures of decision accuracy and consistency can be calculated from a single test, which would be a useful substitute to the methods which require studies of parallel tests to be set up.

7. Conclusion

The report suggests a classification consistency in KS2 science tests of more than 80%, which compares favourably with the prediction which we noted earlier. An area which the report recommends for further study is an investigation of the reasons for the differences in pupils’ performances in parallel forms of the same test. It makes it clear that the types of question used in a test can lead to a greater variety of reliability outcomes than have been found in the annual KS2 science tests, in which the question formats have remained stable for at least five years. Questions which elicit more varied answers from pupils and which demand more-subjective marking will also show less consistency, as will tests and exams in which pupils are classified into a greater number of grades.

The writers hope the report will help to build up a picture of the range of reliability measures that can be used and that this could help to develop policy decisions about the acceptable level of classification correctness or classification consistency.