Research and analysis

International survey of results reporting: Summary

Published 16 May 2013

Applies to England, Northern Ireland and Wales

1. Introduction

The Ofqual reliability programme has explored ways in which uncertainties connected with assessment results can be estimated and communicated to those receiving the results. It was therefore a logical step in the programme to commission a survey of how this issue is dealt with in countries around the world. A small team from the National Foundation for Educational Research (NFER) took on this task and conducted a literature review on the reporting of assessment results. On the basis of that literature review, the team developed a taxonomy of the key characteristics of assessment systems, which they then used as the basis for analysing 26 different assessment systems, from many different countries.

2. Literature review

A crucial issue highlighted in the literature review is the need to match any analysis of the reporting of assessment results to the intended purpose of the test, assessment or examination. It clearly makes a difference if the results are intended to compare students in one area of the curriculum and within a class, school or local area, or whether they are to be used as the basis for determining university entrance in a setting where there is much competition for a limited number of places. Other key issues are the extent to which complicated patterns of performance are summarised through limited systems of numerical or letter grading; whether results are standardised statistically, so that they all conform to a fixed distribution curve; and whether or not the assessment system purports to be norm-referenced or standards-referenced. In a norm referenced system the proportion of students allocated each different result is pre-determined (say 10% to get Grade A), whereas in a standards-referenced system students are assessed against a set of explicit standards and their result depends upon whether they are assessed as meeting that standard or not. Within each of these assessment approaches there is generally an additional issue over the length of any scale used to report individual assessments (e.g. a 5-grade scale, such as A to E), which is linked in some ways to perceptions of how accurate and precise assessments are, because a long scale with lots of fine distinctions can be taken to imply that small distinctions are meaningful. Some assessments report raw scores as percentages, but it is much more common for results to be issued against points in a fixed scale. Typically such scales include anything from two (e.g. pass and fail) to fifteen points. Very short scales are thought to run the risk of discarding useful information about the student’s performance, whereas longer scales may increase the likelihood of students being misclassified.

Finally the literature review indicated that there are very few assessment systems that are known to quantify and report the error or uncertainty associated with assessment results and report that to candidates. This stands in stark contrast to the extensive research literature that discusses the unreliability of educational assessments and examinations and the consequences, especially in high-stakes assessment systems, of acting upon individual results when there is a reasonable degree of uncertainty associated with the result being used. Nevertheless the literature review was used to develop a taxonomy of the key characteristics of assessment systems. This included issues such as how results are reported, whether assessments are internal or external, methods of assessment and, crucially for the programme, whether the results reporting included any reference to possible uncertainty or error in the results.

3. The international survey of reporting assessment results

The main part of this report focuses on the results of the international survey of assessment systems, and covers 26 different examples. Of the systems studied 4 are from different states in Australia, 3 cover different certificates in Ireland, and another 3 cover three different certificate systems in the Netherlands. Each system is analysed in relation to nine areas specified in the taxonomy. Overall the taxonomy covered 6 descriptive elements about how assessments were carried out, 2 categories related to the type of results produced, and a final one indicating whether any reference was made to error or uncertainty in reporting the results.

Key distinctions in this analysis relate to whether different assessments are combined to form some kind of diploma award, or whether they lead to free-standing individual subject awards, or in other cases form part of credit accumulation systems. Another key issue is whether results are reported as grades, marks or in the form of an overall profile – or some combination of more than one approach. There is a great deal of variation across the 26 systems reviewed and the report provides a very useful summary of the different approaches.

For the Reliability Programme the most important areas covered in this survey are the amount of detail given in results, and whether that detail is linked in any way to estimates of possible error or uncertainty. There is a bit of a pattern here in that systems that report the greatest level of detail generally tend to be the ones that are most likely to refer to the uncertainty which is perceived to relate to specific results. Such systems are, however, in the minority; the majority offer definitive results on short scales, as has been the custom in the UK (e.g. A to E grades). The two most interesting examples in this respect are the ACT and SAT, which are both US entry tests for higher education. The SAT results include raw scores, scaled scores, percentiles and means, and include the score range within which a student’s score falls. This essentially uses an estimate of the standard error of measurement to place each score within a band of scores (say, 520 to 560) within which a student’s score might vary if they were tested several times. This information is transmitted to college admissions tutors as well as the individual students. In contrast, the ACT produces composite scores, sub-scores and comparisons with national and state averages, guidance on benchmark scores for entry to higher education, along with their high school grade point average. Interestingly, rather than attempt to estimate levels of uncertainty, the ACT simply addresses it through a health warning: ‘Your test results are only estimates, not precise measures, of your educational development’.

Overall this international survey revealed little in terms of the very specific interests of the Ofqual reliability programme, but it did at least turn up a few examples of ways in which the notion of uncertainty can be brought in to the reporting of assessment results. The general approach of reporting a score within tolerance limits is, of course, fairly well accepted practice in other fields where imprecise measurements are made.

4. Conclusions

The report ends with a few cautionary notes about how culture-specific educational assessment systems may be, and the implications this can have for the expectations of the general public and/or specific users of assessment results. It also warns about the high-stakes use of uncertain assessment results in settings where the number of places is strictly limited, for example for higher education courses. It also warns about the familiar tensions that exist when suggestions are made about reporting uncertainty in assessment results, because of the fear that this might lead to a disproportionate drop in the level of confidence in the results and those responsible for producing them.